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Preface  and  Warning 

These  notes,  which  form  the  first  chapter  of  a  forthcoming  book,  are  intended  to  serve  as  lecture 
notes  on  the  topic  of  large  deviations  and  applications  for  students  whose  background  and  interests 
are  in  applications  which  involve  finite  dimensional  spaces.  Although  narrow  in  their  scope,  these 
notes  present  a  good  deal  of  the  methods  available  for  more  general  situations.  A  glaring  omission 
is  the  method  of  sub-additivity,  which  will  be  discussed  in  another  chapter  in  the  book.  Another 
deficiency  of  these  notes  is  the  sketchy  bibliography  and  historical.  We  hope  to  correct  this  in  the 
book. 

The  funny  looking  ??  appearing  in  various  places  mean  references  to  later  chapters.  Please 
disregard  those. 
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Chapter  1 


Introduction 


1.1  The  Large  Deviations  Principle 

A  large  deviations  principle  characterizes  the  limiting  behavior  (as  e  —  0)  of  a  family  of  probability 
measures  on  {X,Bx)  in  terms  of  a  rate  function.  This  characterization  is  via  ‘tight’  asymptotic 
upper  and  lower  exponential  bounds  on  the  values  that  assigns  to  close  and  open  subsets  of  X 
(respectively).  For  that  purpose  X  should  be  a  topological  space  so  that  open  and  closed  subsets 
of  X  are  well  defined  concepts  and  as  usual  only  measurable  sets  (i.e.,  elements  of  Bx,  the  Borel 
sigma  field  on  A)  are  of  interest. 

Definitions:  A  rate  function  I  is  any  mapping  /  ;  A  [0,  oo]  such  that  for  any  a  €  [0,  oo)  the 
level  set  ^ i{a)  =  {z  :  I{x)  <  q}  is  a  closed  subset  of  X.  A  good  rate  function  is  a  rate  function 
for  which  all  the  level  sets  ®/(o)  are  compact  subsets  of  X. 

Alternatively,  a  rate  function  is  any  non-negative,  lower  semicontinuous  function  on  X  (for  the 
definitions  and  a  proof  of  this  fact  see  Appendix  ??).  Throughout,  let  P/  denote  the  set  of  points 
in  X  of  finite  rate,  namely  Vi  =  {x  :  I{x)  <  oo}. 

Definition:  We  say  that  p^  satisfies  the  large  deviations  principle  with  rate  function  I  if, 
for  all  r  €  Bx, 

—  inf  I(x)  <  liminf  €log;xe(r)  <  limsup  elogPi^T)  <  —  inf_/(z)  (1-1.1) 

xer®  t—o  j_o  xer 

where  F  is  the  closure  of  F,  F°  is  the  interior  of  F  and  the  infimum  of  I  over  an  empty  set  is 
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interpreted  throughout  as  co. 


Corollary  1.1.1  When  satisfies  a  large  deviations  principle  with  rate  function  I  and  F  €  Bx 
is  such  that 

inf  I{x)  =  inf /(x)  =  Jr  (1.1.2) 

then 

lim  €log/i£(r)  = -Jr  (1.1.3) 

Remark:  When  J  is  a  good  rate  function  and  F  n  D/  is  non-empty  then  there  exists  at  least  one 
point  X*  €  F  where  inf^gp  J(x)  is  achieved. 

A  set  F  which  satisfies  (1.1.2)  is  called  an  J  Continuity  Set.  In  general,  a  large  deviations 
principle  implies  a  precise  limit  as  in  (1.1.3)  only  for  J  continuity  sets. 

In  particular,  points  are  typically  not  open  subsets  of  .F  so  a  large  deviations  principle  does 
not  result  with  an  asymptotic  e.xponential  estimate  on  the  probability  that  assigns  to  each  point 
in  A.  Better  results  may  well  be  derived  on  a  case  by  case  basis  for  specific  families  of  measures 
Pi  and  particular  sets.  While  such  results  do  not  fall  within  our  definition  of  a  large  deviations 
principle  few  illustrative  e.xamples  are  included  in  this  book  (see  Sections  2.1,2.6,2.10). 

An  alternative  formulation  of  the  large  deviations  principle  is  as  follows: 

(a) .  For  any  closed  set  F  G  Bx, 

limsup  elog^e(J’)  <  -  inf  J(x)  (1-1-4) 

t— 0  3:6F 

(b) .  For  any  open  set  G  G  Bx, 

liminf  elog pfiG)  >  -  inf  J(x)  (1.1.5) 

The  inequality  (1.1.4)  is  also  called  the  large  deviations  upper  bound  while  (1.1.5)  is  called  the 
large  deviations  lower  bound.  Thus,  the  large  deviations  principle  corresponds  to  the  scenario  in 
which  both  bounds  hold  with  the  same  rate  function. 

A  few  observations  are  now  in  place.  First,  observe  that  since  Pc{W)  =  1  for  any  e  it  is  necessary 
that  infxg^  J(x)  =  0  for  the  upper  bound  to  hold  (and  when  J  is  a  good  rate  function  then  there 
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exists  at  least  one  point  z  in  which  I{x)  =  0).  Next,  observe  that  the  upper  bound  trivially  holds 
whenever  infj:gir/(x)  =  0  while  the  lower  bound  trivially  holds  whenever  inf3;g(5/(i)  =  oo.  This 
leads  to  another  alternative  formulation  of  the  large  deviations  principle  which  is  useful  when  trying 
to  prove  such  a  principle. 

(a) .  For  any  closed  set  F  €  Bx  contained  in  the  complement  of  ^/(a), 

limsup  e\og^i(F)  <  —a  (1.1.6) 

£—0 

(b) .  For  any  i  €  P/  and  any  open  neighborhood  G  G  Bx  of  x  in  X, 

liminf  elog/4f(G)  >  — /(x)  (1-1-7) 

E - 0 


The  inequality  (1.1.7)  reveals  the  local  nature  of  the  lower  bound  which  should  only  be  proved  for 
“small  open  sets”.  Moreover,  the  following  indicates  that  the  upper  bound  may  be  proved  first  for 
an  approximate  /  functional: 

Definition:  For  any  (i  >  0  let  I^{x)  =  /(x)  -  6  when  x  ^  Vj  and  /^(x)  =  j  when  x  6  'D’f. 

Since  for  any  set  F 

P"),  =  inf /(x)  ,  (1.1.8) 

5—0  x^F  x&F 

it  suffices  to  prove  that  for  any  ^  >  0  and  for  any  closed  set  F 

limsup  €  log /Li£(F)  <  —  inf  I^(x)  (1.1.9) 

£—0  xeF 

in  order  to  conclude  that  the  upper  bound  (1.1.4)  holds. 

.Actually,  a  common  technique  for  proving  the  existence  of  a  large  deviations  principle  is  by 
implicitly  defining 

/(x)  =  sup  Ca  (1.1.10) 

where 

£,4  =  —  liminf  elog/ie(A)  ,  (1.1.11) 

and  C°  C  Bx  is  any  basis  for  the  topology  of  X  (namely,  any  open  set  G  is  the  union  of  sets  from 
C°). 
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This  definition  results  with  a  rate  function  /  for  which  the  lower  bound  (1.1.5)  holds.  Moreover, 
when  limj_o  elog/ij(A)  exists  for  any  .4  €  then  the  upper  bound  (1.1.4)  also  holds  for  any 
compact  F  G  Bx  (these  statements  are  proved  in  Section  ??). 

This  approach  in  which  the  upper  bound  is  first  proved  directly  for  compact  sets  and  then  its 
validity  extended  to  all  closed  sets  serves  as  a  motivation  for  the  following  definition. 

Definition:  A  sequence  of  probability  measures  He,  satisfies  weakly  the  large  deviations  principle 
with  rate  function  I  if  the  upper  bound  (1.1.4)  holds  for  F  compact  and  the  lower  bound  (1.1.5) 
holds. 

The  following  auxiliary  exponential  tightness  property  suffices  for  extending  the  weak  large 
deviations  principle  to  a  full  large  deviations  principle  with  a  good  rate  function  (this  is  shown  in 
the  sequel). 

Definition:  A  family  of  measures  is  exponentially  tight  if  for  any  a  <  co  there  exists  a 
compact  set  Ka  €  Bx  such  that 

limsup  elog Piili'l)  <  —a  (1.1.12) 

£—0 

where  denotes  the  complement  of  the  set  li'c- 

Remark:  The  measures  Pf,  may  satisfy  a  large  deviations  principle  with  a  good  rate  function 
without  being  exponentially  tight.  Beware  of  this  common  logical  mistake. 

The  exponential  tightness  (and  the  alternative  statement  of  the  upper  bound  (1.1.9)  are  applied 
in  the  following  lemma  for  strengthening  a  weak  large  deviations  principle. 

Lemma  1.1.1  Let  be  an  exponentially  tight  family  of  probability  measures. 

(a) .  If  the  upper  bound  (1.1. 4)  holds  for  all  compact  sets  then  it  also  holds  for  all  closed  sets. 

(b) .  If  the  lower  bound  (1.1.5)  holds  for  all  open  sets  then  /(•)  is  a  good  rate  function. 

Thus,  when  an  e.xponentially  tight  family  of  measures  satisfies  weakly  the  large  deviations  principle 
with  a  rate  function  /(•)  then  /  is  a  good  rate  function  and  a  full  large  deviations  principle  holds. 
Proof:  (a).  Consider  an  arbitrary  closed  set  F.  .Ad  we  need  is  to  estabUsh  (1.1.6)  whenever 
F  C  ^i{ay.  Fix  any  such  a  <  oo.  Clearly, 

h^F)  <  p^(F  n  Ka)  +  Ptilia)  , 
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where  Ka  is  the  compact  set  in  (1.1.12).  Part  (a)  of  the  lemma  now  follows  by  the  inequality 
(1.1.12)  and  the  upper  bound  (1.1.4)  for  the  compact  set  F  fi  Ka  (note  that  F  n  Ka  C  ^'/(a)'^,  so 
Inf^eFnATa  >  “)• 

(b).  Applying  the  lower  bound  (1.1.5)  to  the  open  set  K%,  one  concludes  by  (1.1.12)  that  inf^gA''  I{^)  > 
a.  Therefore,  /(a)  C  Ka,  yielding  the  compactness  of  the  closed  level  set  ^'/(q:).  .4s  this  argument 
holds  for  any  cc  <  oo,  it  follows  that  /(•)  is  a  good  rate  function.  □ 

A  countable  family  of  measures  jin  is  considered  in  many  cases  (for  example  when  is  the  law 
governing  the  empirical  mean  of  n  random  variables).  Then,  a  large  deviations  principle  corresponds 
to  the  statement 

—  inf  I{x)  <  liminf  anlog/inlT)  <  limsup  anlog;in(r)  <  —  inf_/(i)  ,  (1.1.13) 

xgr®  n—co  n— oo  xeP 

for  some  sequence  a-n  0.  Note  that  here  replaces  e  of  (1.1.1)  and  similarly  the  statements 
(1.1.4)-(1.1.7)  may  be  appropriately  modified. 

1.2  Major  large  deviations  techniques 

The  major  disadvantage  of  the  very  general  approach  outlined  above  is  that  it  does  not  reaUy 
reveal  what  the  rate  function  values  are.  However,  since  the  rate  function  associated  with  a  large 
deviations  principle  is,  under  mild  conditions,  unique  (as  proved  in  Section  ??)  the  following  two 
step  method  is  quite  useful: 

(a) .  Prove  the  existence  of  a  large  deviations  principle. 

(b) .  Verify  certain  properties  of  the  rate  function  (typically  convexity  and/or  smoothness)  and 

from  them  deduce  a  more  convenient  (e.xpiicit)  characterization  of  this  function. 

Indeed  this  method  prevails  to  a  certain  degree  throughout  Chapters  ??-??  of  this  book. 

Nevertheless,  when  A"  is  a  subset  of  a  finite  dimensional  vector  space  (specifically  when  X  C 
for  some  d  <  oo)  then  one  can  typically  prove  directly  the  e.xistence  of  a  large  deviations  principle 
with  an  explicit  rate  function.  Chapter  2  of  this  book  which  is  dedicated  to  this  class  of  problems 
serves  at  least  three  purposes: 
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(a) .  To  illustrate  the  sharpest  possible  results  and  emphasize  both  the  reason  for  the  existence  of 

a  large  deviations  principle  and  the  types  of  rate  functions  one  expects  to  find. 

(b) .  To  demonstrate  the  different  methods  for  proving  explicit  large  deviations  statements  in 

simple  scenarios  when  one’s  eyesight  is  not  yet  obscured  by  various  technical  details. 

(c) .  To  present  an  important  class  of  interesting  results  and  their  applications  while  requiring 

relatively  little  mathematical  background. 
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Chapter  2 


Large  Deviations  principles  for  finite 
dimensional  spaces 


2.1  Combinatorial  Techniques  for  finite  alphabet 

Throughout  this  section  all  random  variables  assume  values  in  the  finite  set  S  =  {ai ,  02, . . . ,  a|E|}  (S 
is  also  sometimes  called  the  underlying  alphabet).  Combinatorial  methods  are  then  applicable  for 
deriving  large  deviations  principles  for  empirical  measures  (see  Sections  2.1.1  and  2.1.3  below)  and 
for  empirical  means  (see  Section  2.1.2  below).  While  the  scope  of  these  methods  is  limited  to  finite 
alphabets,  they  illustrate  the  results  one  can  hope  to  (and  should  indeed)  derive  for  more  abstract 
edphabets.  Some  of  these  results  are  actually  direct  consequences  of  the  large  deviations  principles 
derived  below  via  the  combinatorial  method.  For  example,  in  Sections  ??  and  ??  Theorems  2.1.1 
and  2.1.3  are  proved  for  a  rather  abstract  alphabet  E  (specifically,  for  any  Hausdorff  topological 
space  E).  The  combinatorial  methods,  unlike  all  other  approaches  for  deriving  large  deviations 
principles,  are  based  upon  point  estimates.  For  e.xample  they  bound  the  probabilities  associated 
with  each  possible  outcome  of  the  empirical  measure  (also  denoted  as  type,  see  the  definition  below). 
This  turns  out  to  be  very  useful  for  some  statistical  applications  (an  example  is  presented  in  Section 
2.6). 
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2.1.1  The  method  of  types  and  Sanov’s  theorem 

Let  ,  A'n  be  a  sequence  of  random  variables  (r.v.)  which  are  independent,  identically 

distributed  {i.i.d.)  according  to  the  law  p,  6  Mi(S).  Throughout,  iV/i(S)  denotes  the  space  of  all 
probability  measures  (laws)  on  the  alphabet  S.  Typically  we  shall  specify  the  topology  with  which 
A/i(E)  is  equipped.  However,  recall  that  here  S  is  a  finite  set,  so  Mi(S)  is  identified  with  the 
standard  probability  simplex  in  IR^^^  i.e,  the  set  of  aJl  real  valued  vectors  with  |E]  non-negative 
components  which  sum  to  1,  and  open  sets  in  A/i(S)  are  induced  by  open  sets  in  IRI^L 

Let  denote  the  support  of  the  law  p,  i.e.,  the  set  =  {a,-  ;  p{ai)  >  0}.  In  general,  may 
be  a  strict  subset  of  S  (i.e.,  p  may  assign  zero  probability  to  some  of  the  symbols  in  E).  However, 
when  considering  one  underlying  measure  p  we  may  without  loss  of  generality  (w.l.o.g.)  reduce  E 
to  E^  by  ignoring  all  symbols  which  appear  with  zero  probability.  This  indeed  is  implicitly  assumed 
throughout  this  Section  (in  Section  2.6  below  we  encounter  a  scenario  where  we  have  to  keep  track 
of  various  support  sets  of  the  form  of  E^^). 

Definition  2.1.1  The  type  of  a  sequence  x  =  {xi,...,x„}  is  the  empirical  measure  (law) 
associated  with  this  sequence.  Specifically,  =  {L^{ai), . . . ,  1^(0^)}  is  an  element  o/Mi(E), 
where 

=  -  E  1-;—  =  -^V(a.|x)  ,  i  =  1, . . . ,  lEl  (2.1.1) 

n  ^ — ■  n 

j=i 

Let  Cn  denote  the  set  of  all  possible  types  of  sequences  of  length  n.  Thus, 

Cn  =  {v  :  u  =  for  some  x}  C  IR^"'.  Note  that  the  random  empirical  measure  associated 
with  the  sequence  X  =  {A'l, - .Yn}  must  be  an  element  of  the  set  £„. 

The  usefulness  of  the  notion  of  types  for  finite  alphabets  as  well  as  the  reason  why  it  is  not 
readily  extended  to  more  general  alphabets  are  due  to  the  following  ‘volume’  and  ‘approximation 
distance’  estimates: 

Lemma  2.1.1  (a).  |£nl  <  (^  +  1)*^' 

(b).  For  any  probability  vector  i/  £  Mi{E) 

dv{i^,Cn)=  inf  dv(i',i'')  <  IP- -  (2.1.2) 

^  ZTI 
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where  dv{i',i>')  =  sup4^v[i^(.-l)  -  is  the  variational  distance  between  the  measures  u 

and  u' . 

Proof;  Note  that  any  component  of  the  vector  belongs  to  the  set  i  whose  cardinality 

(size)  is  (ti  +  1).  Part  (a)  of  the  lemma  follows  since  the  vector  is  specified  by  |S|  such  quantities 
each  of  which  assumes  at  most  (n  4-  1.)  distinct  values. 

To  prove  peirt  (b)  observe  that  £„  indeed  contains  aU  probability  vectors  which  are  composed 
of  IS]  elements  from  the  set  Thus,  for  any  vector  v  €  Mi(S)  there  exists  a  vector 

v'  €  £n  with  |t'(a,)  -  v'{ai)\  <  ^  for  i  =  1, - |S|.  The  bound  of  (2.1.2)  now  follows  since  for  any 

discrete  alphabet 


dv(i^,  t-')  <  -  ^  |;/(ai)  -  i/'(a,)l  . 


□ 


Remarks: 

1.  Actually,  since  is  a  probability  vector,  at  most  jS]  —  1  such  quantities  should  be  specified 
and  so  |£nl  < 

2.  Lemma  2.1.1  states  that  the  volume  of  £„,  the  support  of  the  random  empirical  measures  L^, 
grows  polynomiaily  in  n  and  further  that  for  large  enough  n,  the  set  £„  approximates  uniformly 
and  arbitrarily  well  (in  the  sense  of  variational  distance)  any  measure  in  iV/i(S).  Both  properties 
are  invalid  when  jS]  =  co. 

Definition  2.1.2  The  type  (composition)  class  T{u)  of  a  probability  law  i/  €  £n  is  the  set 
T{u)  =  {x6  S"  :  =  y). 


Outcomes  x  in  the  same  type  class  are  equally  likely  when  ATi, . . . ,  Ar„  are  i.i.d.  random  variables. 
This  is  part  of  the  first  among  the  three  lemmas  in  the  sequel  which  yield  a  strong  prelude  to 
large  deviations  results  by  estimating  the  e.xponential  growth  of  each  type  clsiss  and  the  precise 
probability  of  any  specific  outcome  from  a  given  type  class.  The  following  definitions  are  useful  for 
that  purpose: 

Definitions: 
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(1) .  The  entropy  of  a  probability  vector  u  is  logi/(ai),  where  OlogO  =  0 

throughout. 

(2) .  The  (relative)  entropy  of  a  probability  vector  z/  relative  to  fi  is 

a.es. 

when  C  (in  which  case  <  oo).  Otherwise  =  oo  (whenever  i^(a,‘)  >  0 

while  nidi)  =  0  for  some  a,-  6  S). 

Remark;  Note  that  is  a  rate  function.  Indeed,  the  set  Mil'S  ft)  is  a  closed  subset  of  Mi(E)  in 

which  ffl-\f.i)  is  a  continuous  function  (note  that  the  function  — xlog  x  is  continuous  on  [0,1]),  while 
Hl-ln)  =  oo  outside  iV/i(S^).  Finally,  the  non-negativity  of  follows  by  Jensen’s  inequality. 


Lemma  2.1.2  If  x  €  T(i/)  for  v  €  Cn,  then 

Prob^({Xi,. ...  A'n}  =  x)  =  (2.1.3) 

where  Prob^^  corresponds  to  the  probability  law  associated  with  an  infinite  sequence  of  random 
variables  {Aj}  which  are  independent  and  identically  distributed  according  to  p  £  Afi(S). 


Proof;  Clearly  (2.1.3)  holds  when  Hli/\p)  =  co  as  the  random  empirical  measure  concentrates 
on  types  u  £  Cn  for  which  C  (i.e,  Hli^\p)  <  oo).  We  may  thus  assume  that  Hl^lp)  <  oo 
and  E,y  C  E^. 

Since  =■  v  and  Ff(z/)  -f  H{i>\p)  =  —  Xll=i  ’^{di)  log/i(a,)  it  follows  that 


iS|  |S| 

Prob^({Ai,...,A„}  =  x)  =  JJ 

:=1  i=l 


(2.1.4) 

□ 


In  particular,  since  H{p\p)  =  0,  when  p  £  Cn  and  x  £  Tip)  then 


Prob;,({Ai, . . . ,  A„}  =  x)  = 


e-nH(M) 


(2.1.5) 


Lemma  2.1.3  For  any  u  £  Cn,  In  +  i)-|S|en//(t.)  <  |T(i/)|  < 
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Remark:  Since  lT(t/)|  =  ,  one  might  get  a  good  estimate  of  |r(i/)|  by 

n:/(ai n:/{a|£|)  j 

applying  Stirling’s  approximation  (see  [12],  pg  48).  We  take  here  a  somewhat  different  route,  of  a 
more  information  theoretic  flavor. 

Proof:Any  type  class  has  probability  at  most  one  and  all  its  members  are  of  equal  probability. 
Therefore,  for  any  i/  €  Cn,  by  (2.1.5), 


1  >  Prob,.(X^  =  u)  =  Prob^(X  €  T{u))  = 


(2.1.6) 


and  the  upper  bound  on  \T{i>)\  follows.  The  lower  bound  follows  from  the  inequality  (which  is 
proved  below) 


Probi.(X;()‘  =  v)  >  Probi,(i^  =  t^')  ,  Vz/',  u  €  Cn, 


as  thus 


1=  ^  Prob,,(X^  =  z/')<  |£„lProb,(I^  =  z.)  =  l£„le-’‘^W 


(2.1.7) 


while  |£„1  <  (n  +  l)i“l  by  Lemma  2.1.1. 

It  suffices  to  prove  the  inequality  (2.1.7)  for  €  £n  such  that  Si,/  C  S„  (as  otherwise 
Prob„(i^  =  i/')  =  0).  Thus,  without  loss  of  generality  one  may  assume  that  S  =  S„.  Now, 


consider  the  ratio 


n  \  |S| 

Probi,(X^  =  I/)  ni/(ai)...nz/(a|v|)  j  n 

Prob„(XX  =  ^/)  -  /  \  ^ 

I  nt''(ai ) . . .  nz/'(a|v;|)  j  <=^ 


=  fr  (^^^'(a.))!  /  xnt/(a,)-nt/qaa 

(n^a,))!  ^ 


(2.1.8) 


This  last  expression  is  a  product  of  terms  of  the  form  ^  -By  induction  ^  for  any 

m,  /  €  Z"^,  and  thus  (2.1.8)  yields 

p£2M^  ,  g  «<...]  ^  I 

Probi,(X^  =  u')  ' 
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Lemma  2.1.4  (Large  deviations  probabilities)  For  any  u  e  Cn 

(n  +  <  Prob^(Z^  =  t/)  <  (2.1.10) 

Proof:  By  Lemma  2.1.2 

PT0h,iL^=:u)  =  \TM\?VOh,aXu...,Xn}  =  ^>  L^  =  U) 

=  |T(i/)|  .  (2.1.11) 

The  proof  is  now  completed  by  applying  Lemma  2.1.3.  □ 

Combining  Lemma  2.1.1  and  Lemma  2.1.4  we  prove  Sanov’s  theorem  for  the  finite  alphabet  S, 
our  first  large  deviations  principle.  Later  on.  in  Sections  ??  and  ??  we  shaU  extend  this  result 
to  an  appropriate  class  of  topological  spaces  covering  in  particular  the  important  case  of 

S  =  IR'^. 

Sanov’s  theorem  deals  with  the  sequence  of  laws  governing  the  random  empirical  measure 
which  is  a  random  element  of  the  probability  simplex  4/i(S)  when  the  underlying  independent 
random  variables  are  identically  distributed  according  to  the  law  {.l. 

Theorem  2.1.1  (Sanov’s  theorem)  For  any  set  F  of  probability  vectors  in  Mi(S)  C 

-  [n{^H(i/\p.)  <  ^logProb^(lJ  6  F)  <  limsup  ^  logProb;i(X^  S  L)  <  -  inf  Xr(i/|^)  , 

(2.1.12) 

where  F°  is  the  interior  ofV  (as  a  set  in  A/i(T)  C  IR'“' ). 

Remark:  Comparing  (2.1.12)  and  the  definition  (1.1.13)  we  conclude  that  Sanov’s  theorem  states 
that  the  family  of  laws  Prob^(X^  G  •)  satisfies  a  large  deviations  principle  with  the  rate  function 
H{'\p).  Further,  in  the  particular  case  of  finite  alphabet  S  covered  here  the  upper  bound  holds  for 
any  set  F  (i.e.,  there  is  no  need  for  a  closure  operation).  For  few  other  improvements  over  (2.1.12) 
which  are  specific  to  this  case  see  exercises  2.1.1.  2.1.2  and  2.1.3.  Note  however  that  there  are  closed 
sets  for  which  the  upper  and  lower  bounds  of  (2.1.12)  are  far  apart  (see  exercise  2.1.5).  Moreover, 
there  are  closed  sets  for  which  the  limit  of  ^logProb^(X^  G  F)  does  not  exist  at  all  (see  exercise 
2.1.4). 
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Proof:  We  begin  by  deducing  from  Lemma  2.1.4  upper  and  lower  bounds  for  any  finite  n.  The 
upper  bound 

Prob^(Z^er)  =  Prob^(i^  =  hHm) 

v6rnc„  :/ernr:„ 

<  (n  +  e""  ‘'^‘'ern£„  H(Hm)  (2.1.13) 

follows  by  the  union  of  events  bound  and  the  upper  bound  of  Lemma  2.1.4.  The  accompanying 
lower  bound  is 

Prob^CL^  S  P)  =  Prob^(Lj  =  u)  >  J2  (^+1)"'^'  > 

uevnCn  uernc„ 

>  (,i4.  l)-l“i  e~"  _  (2.1.14) 

As  limn— CO  n  ^°g(”  +  1)^^'  =  0  the  normalized  logarithmic  limits  of  (2.1.13)  and  (2.1.14)  yield 
limsup  —  log  Prob„(L^  6  P)  =  —  liminf  {  inf  (2.1.15) 

n— CX5  n  n— oo  ueVnCn 

and 

liminf  —  log  FToh^{L^  G  P)  =  -  limsup  {  inf  H{u\iJ.)}  .  (2.1.16) 

n— *00  tl  n— ^co 

The  upper  bound  of  (2.1.12)  follows  since  P  n  £„  C  P  for  any  n. 

Turning  now  to  the  lower  bound  of  (2.1.12),  Lx  an  arbitrary  point  6  P".  The  set 
=  {v'  :  dv{v-,y')  <  <5}  must  be  contained  in  P  for  all  5  >  0  sufficiently  small  (a.s  i>  is  in  the 
interior  of  P).  Thus,  by  part  (b)  of  Lemma  2.1.1  (see  (2.1.2))  there  exists  a  sequence  !/„  €  P  H  £„ 
such  that  u-n.  ^  i'  ^  n  -*  oo.  As  already  observed,  S  may  be  identified  with  without  loss  of 
generality.  Then  is  a  continuous  function  and  therefore, 

-limsup  inf  H{u'\fi)>-  lim  Hii/nlfi)  =  .  (2.1.17) 

71—00  i/'ernC„  n— oo 

The  lower  bound  of  (2.1.12)  follows  by  taking  the  infimum  over  t'  G  P°  in  (2.1.17).  □ 

Exercises; 

2.1.1  Prove  that  for  any  open  set  P 

-  Urn  {  inf  Eity\/j,)}  =  lirn  ^  log  Prob^  (£^  G  P)  =  -  inf  H{iy\n)  =  -Pp  •  (2.1.18) 

n— CO  i/gmCn  n— CO  n  i/gF 
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2.1.2  (a).  Extend  the  conclusions  of  exercise  2.1.1  to  any  set  F  C  which  is  contained 

in  the  closure  of  its  interior. 

(b).  Prove  that  for  any  such  set  Jp  <  oo  and  Jp  =  FI{u'\ijl)  for  some  v'  e  r°. 

Hint;  Use  the  continuity  of  on  and  the  compactness  of  this  set. 

2.1.3  Assume  that  F  is  a  convex  subset  of  of  non-empty  interior.  Prove  that  ail  the 

conclusions  of  exercise  2.1.2  apply.  Moreover,  prove  that  here  the  point  i/’  €  F®  for  which 
It  =  H{v*\^)  is  unique. 

Hint:  Show  that  for  any  €  F  and  any  u'  €  F®  the  line  segment  connecting  u  and  v'  is  part  of 
F®.  Deduce  that  F  C  F°  which  is  a  closed  convex  set.  Then  prove  that  is  a  strictly  convex 

function  in 

2.1.4  Find  a  closed  set  F  for  which  both  of  the  limits  in  (2.1.18)  do  not  exist. 

Hint:  Any  set  F  =  {i/}  consisting  of  one  probability  vector  u  G  which  also  belongs  to  £„ 

(for  some  n)  shall  do. 

2.1.5  Find  a  closed  set  F  such  that  F  =  r°  and  inf^^er  II{i^\fi)  <  oo  while  inft,gr°  =  oo. 

Hint;  For  this  construction  you  need  ^  S.  Try  F  =  {;/  :  +  v'i'(ai)  >  1}  and  =  0 

(while  1E|  =  3). 

2.1.2  Cramer’s  theorem  for  finite  alphabets  in  IR^ 

.4s  an  application  of  Sanov’s  theorem,  we  bring  a  proof  of  a  version  of  Cramer’s  theorem  about  the 
large  deviations  of  the  empirical  mean  of  i.i.d  random  variables.  For  that  purpose  we  further  assume 
throughout  this  section  that  S  =  is  a  finite  subset  of  IR^  and  let  Sn  =  ^  denote  the 

resulting  sequence  of  empirical  means  (where  Xj  6  E  as  in  Section  2.1.1  above).  Cramer’s  theorem 
deals  with  the  large  deviations  principle  satisfied  by  the  family  of  laws  governing  the  real  valued 
random  variables  Sn-  Sections  2.2  and  2.3  are  devoted  to  successive  generalizations  of  this  result 
to  S  =  IR^  (Section  2.2)  and  to  weakly  dependent  random  vectors  in  IR*^  (Section  2.3). 

Note  that  in  the  case  considered  here  (|E1  <  oo),  the  random  variables  Sn  assume  values  in  the 
compact  interval  K  =  [mm;“  j  {a,},  maxj^j  Moreover, 
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(2.1.19) 


Sn  =  IZlSi  =<  -^^7^  >  where  a  =  (cti , . . . ,  a|r|}.  Therefore, 

Sji  £  A  o  £  {t'  :  <  a  >£  A}  =  T 

for  any  A  C  A'  and  any  integer  n.  Thus,  the  following  version  of  Cramer’s  theorem  is  a  direct 

consequence  of  Theorem  2.1.1. 

Theorem  2.1.2  (Cramer’s  theorem  for  finite  subsets  of  IR^) 

(a) .  For  any  set  A  C  K 

-  inf  /(a:)  <  liminf  —  logProb^(5n  £  A)  <  limsup  ilogProb^(5„  £  A)  <  -  inf  I(x)  ,  (2.1.20) 

xeAo  n—oo  n  n_oo  n  xeA  ^  ' 

where  A°  is  the  interior  of  A  (as  a  set  in  IR^j  and  /(z)  =  inf<y,a>=2: 

(b) .  The  continuous  rate  function  I{x)  (for  x  £  K)  also  satisfies 

/(x)  =  sup  {Ax  -  .'\.(A)}  ,  (2.1.21) 

.\6R‘ 

where 

|S| 

A(A)  =  log^  /i(a,)e''“’  . 

t=i 

Remark:  Since  the  rate  function  is  continuous  (on  K)  it  follows  from  (2.1.20)  that  whenever 
A  C  ^  C  A' 

lim  —  log  Probfi(5n  £  .4)  =  —  inf  J(x)  . 

n— oo  n  x6A 

Proof: 

(a) .  The  bounds  of  (2.1.20)  are  simply  the  bounds  of  (2.1.12)  for  the  set  P  defined  in  (2.1.19)  (note 
that  {v  :<  i/,a  >£  A°}  C  P®). 

(b) .  Observe  that  by  Jensen’s  inequality 

l"l  I— I  \a 

A(A)  =  log^  ;u(a,-)e^“’  >  ^  t^(a,)log  =  A  <  i/,a  >  ,  (2.1.22) 

t=i  t=i  7  '' 

for  any  v  £  Mi(S)  and  any  A  £  IR\  with  equality  for  i^A(a,)  =  /i(a,)e^“’“‘'(^K 

The  function  A(A)  is  differentiable  and  strictly  convex,  implying  that  A'(A)  is  strictly  increasing. 
Moreover,  K  =  [infA  A'(A) ,  sup^  A'(A)].  Therefore,  for  any  x  £  K°  there  exists  a  unique  A^®^  £  IR^ 
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such  that  X  —  A'(A('^))  =  0.  As  <  :/\(x)  ,  a  >=  A'(A(^))  =  x.  the  application  of  (2.1.22)  for  A^^)  yields 

I{x)=  inf  if(H^)  =  //(v\(-)iM)  =  A(^)x-.A(A(^))=  sup  {Ax-A(A)},  (2.1.23) 

i.e.,  (2.1.21)  holds  for  any  x  €  K°.  The  continuity  of  I(x)  (for  x  €  K),  which  is  a  direct  consequence 
of  the  continuity  of  implies  that  (2.1.21)  holds  also  at  the  boundaries  of  K.  □ 

It  is  interesting  to  note  that  by  deriving  Cramer’s  theorem  from  Sanov’s  theorem,  we  have 
followed  a  pattern  which  wiU  be  useful  in  the  sequel  and  which  is  referred  to  as  the  ‘contraction 
principle’  (cf  Section  ??).  Indeed,  based  on  large  deviations  result  in  the  “big”  space  (Mi(S)),  we 
obtained  a  large  deviations  result  on  a  “smaller”  space  which  is  obtained  from  the  original  space  by 
a  continuous  map  (“contraction”).  In  the  particular  case  in  hand,  however,  an  alternative  proof  of 
the  finite  alphabet  Sanov’s  theorem  follows  directly  from  Cramer’s  theorem  in  IR^^I  and  is  presented 
in  section  2.3. 

Exercises: 

2.1.6  Construct  an  example  for  which  the  limit  of  AiogProb^(5„  =  x)  as  n  oo  does  not  exist. 
Hint:  Note  that  for  |S|  =  2  the  empirical  mean  (5„)  uniquely  determines  the  empirical  measure 
(T^).  Relay  on  this  observation  and  exercise  2.1.4  to  construct  the  desired  example. 

2.1.7  (a).  Prove  that  I{x)  =  0  if  and  only  if  x  =  x  =  E^{Xi).  Explain  why  this  should  have  been 
anticipated  in  view  of  the  weak  law  of  large  numbers  (which  states  that  5„  x  in  probability,  as 
n  — *■  oo). 

(b).  Check  that  H{u\iJ.)  =  0  if  and  only  if  t/  =  ju  and  interpret  this  result. 

2.1.8  Guess  the  value  of  limn_co  Prob^(A'i  =  ft,|5n  >  q)  iox  x  <  q  ^  K°  and  try  to  justify  your 
guess  (at  least  heuristically). 

2.1.9  Extend  Theorem  2.1.2  to  E  which  is  a  finite  subset  of  In  particular,  determine  the 
shape  of  the  set  K  and  find  the  appropriate  extension  of  the  formula  (2.1.21). 

2.1.3  Large  deviations  for  sampling  without  replacements 

The  method  of  types  is  useful  also  for  studying  the  large  deviations  of  the  empirical  measure  of  a 
sequence  of  dependent  random  variables.  For  e.xample,  consider  the  following  set  up  of  sampling 
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without  replacements  which  is  often  encountered  in  statistical  problems.  Out  of  an  initial  deter¬ 
ministic  pool  of  iVf  items  x  =  (xi, . . . ,  x,v/),  an  n-tuple  X  =  (x,-,,  x,-^,  ...,x,„)  is  sampled  without 
replacement,  namely,  ii  h  ^  is  7^  ■■■  r  and  all  ^  ^  choices  of  ii  ^2  •  •  •  in  €  {1, . . . ,  M} 

are  equally  likely. 

Suppose  now  that  Xj^^^ ,  •  -  ■ ,  elements  of  the  same  finite  set  S  =  {oi, . . . ,  a|s|}  and  that 

diS  M  —>■  oo  the  deterministic  relative  frequency  vectors  . . . ,  converge 

(point  wise)  to  ^  €  Mi(S).  Recall  that 


1  ‘ ' 
j=i 


A  1 


M 


^  M(a,|x)  ,  i  =  l,...,|S|  . 


(2.1.24) 


Suppose  further  that  X  is  a  random  vector  obtained  by  the  sampling  without  replacement  of  n 
out  of  M  elements  as  described  above.  We  investigate  the  large  deviations  principle  governing  the 
random  empirical  measure  associated  with  the  vector  X.  Note  that  belongs  to  the  set  £„ 
whose  size  grows  polynomially  in  n  (see  Lemma  2.1.1).  In  particular  we  aim  at  deriving  the  analog 
of  Theorem  2.1.1  when  lim.v/^oo  {^)  =  €  (0, 1).  For  that  purpose  define  the  following  candidate 

rate  function 


W^)  =  ^ 


if  M.-  >  for  ail  i 


(2.1.25) 


00 


otherwise 


By  basic  combinatorics  and  an  application  of  Lemma  2.1.3  we  obtain  the  following  estimates  of 
large  deviations  probabilities  for  . 


Lemma  2.1.5  For  any  probability  vector  i/  £  Cn 


(a).  If  Ij2.  ix  {o)  <  oo  then 

M  ’  M 


ilogProb(I^  =  o)  +  Ij^Lx(u) 


<2(|S|4-1) 


log(M  +  1) 


n 


(2.1.26) 


(b).  If  ix  {u)  =  00  then 

M  ’  -Vf 


Prob(L^  =  u)  =  Q 


(2.1.27) 


Proof:  (a).  In  the  sampling  without  replacement  procedure  the  probability  of  the  event  {L^  =  v} 
for  €  £n  is  exactly  the  number  of  n-tuples  i\  ^  j2  •  •  •  #  in  resulting  with  type  i/  divided  by  the 
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overall  number  of  n-tuples,  that  is 


ProbCi^  =  u) 


S  /  MLUui)  ] 

rJi  \  nu{ai)  J 


(2.1.28) 


An  application  of  Lemma  2.1.3  for  |E|  =  2  (where  |r(i/)| 
results  with  the  following  estimate 


when  u{ai)  =  | ,  ^/(aj)  =  1-^), 


max 

0<k<n 


log 


n 

k 


<  21og(n  +  1)  , 


(2.1.29) 


where 

H{p)  =  -plogp  -  (1  -  p)  log(l  -  p) . 


Alternatively,  (2.1.29)  follows  by  Stirling's  formula  (see  [12],  pg  48). 


Combining  the  exact  expression  (2.1.28)  and  the  bound  of  (2.1.29)  results  with 


—  log  Prob(£„ 
n 


o-E 

1=1 


M  Lffjai)  f  nu{ai) 

n  \ML^f{ai) 


M 

+  —H 


n 

M 


<  2(|S1  +  1)  t 

(2.1.30) 


The  inequality  (2.1.26)  follows  when  rearranging  the  left  side  of  (2.1.30). 

(b).  Note  that  In  ^x.  (t/)  =  oo  if  and  only  if  nt/(ai)  >  AlZ^(ai)  =  M(a,|x)  for  some  a,’  6  S.  It 
is  then  impossible  in  sampling  without  replacement  to  have  =  i/,  as  nL^(ai)  =  iV(ai|X)  < 
M(a,|x)  for  any  a,  g  S.  □ 


Following  the  proof  of  Theorem  2.1.1  the  above  point  estimates  result  with  the  analogs  of 
(2.1.15)  and  (2.1.16),  namely; 


Corollary  2.1.1 


and 


limsup  —  log  Prob(T^  6  P)  =  -liminf  {  inf  In.  rx  (i/)  }  (2.1.31) 

jvr-oo  ^  ^  ^  ''  '' 


1  Y 

liminf  -  log  Prob(i„  e  P)  =  -  limsup  {  inf  In.  rx  it/)  }  .  (2.1.32) 

A/— oo  n  “  '  "  M-oo  ‘'ern£„  ^  ^  > 
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The  proof  of  Corollary  2.1.1  is  left  as  e.xercise  2.1.10. 

The  following  theorem  is  the  desired  analog  of  Theorem  2.1.1. 


Theorem  2.1.3  The  good  rate  function  controls  the  large  deviations  principle  related  to  the 

random  empirical  measures  (as  elements  of  Specifically,  for  any  set  F  of  probability 

vectors  in  Mi(S)  C 


-  inf  ^(:/)  <  Uminf  ^logProb(L^ 

^'€1®  M— *oo  71 


1  Y 

e  F)  <  limsup  —  logProb(i„ 
A  f— CO  n 


€  F)  <  -  inf  , 

i^er 

(2.1.33) 


where  (3  =  limA/_co  €  (0, 1)  and  Vfj  converges  to  p  as  M  -*  oo. 


Remark:  Note  that  the  upper  bound  of  (2.1.33)  is  weaker  than  the  upper  bound  of  (2.1.12).  Also, 
see  exercise  2.1.11  for  examples  of  sets  for  which  the  above  lower  and  upper  bounds  coincide. 

Proof:  As  already  noted,  is  a  rate  function  on  iV/i(S),  for  any  fixed  p  (namely,  it  is  lower 

semicontinuous  and  non-negative).  Moreover,  the  set  {u  :  {p  —  di/)l{l  —  p)  E  iV/i(S)}  is  a  closed 
subset  of  jVfi(S)  and  therefore.  //3,u(-)  is  also  a  rate  function  for  any  fixed  /3  E  (0, 1)  and  any  fixed 
p  E  Mi(E).  Since  |i;|  <  oo,  the  probability  simplex  jV/i(S)  is  a  compact  set  and  thus  any  rate 
function  on  Mi(S)  is  a  good  rate  function. 


The  first  step  in  deriving  the  bounds  of  (2.1.33)  is  to  prove  that  //3,^(i')  is  a  lower  semicontinuous 
function  over  (0,1)  x  Mi(S)  x  Mi(E)  (jointly  in  p,  p  and  t/)  and  is  strictly  continuous  along 
sequences  {Pnipm^^n}  where  <  oo.  For  that  purpose,  fix  a  sequence  {Pn,Pn,i^n}  such 

that  Pn  P  €  (0, 1),  t/n  — *■  and  p^  —  p-  The  lower  semicontinuity  is  trivial  along  any  subsequence 
for  which  )  =  CO.  Thus,  without  loss  of  generality  assume  that  l0„,ii„{i'n)  <  oo 

for  all  n  large  enough,  namely  that  Pnio,i)  >  pni^nio^i)  for  any  Ui  E  S.  Then,  by  rearranging  (2.1.25) 


1 

Pn 


iF(Mn)  -  Hiu,,)  - 


Pn  Pn^n s 

l-Pn  ’ 


As  Pn  are  bounded  away  from  0  and  1  for  large  n,  it  follows  from  the  above  expression  that 
l0n,nn{^n)  — *•  ■f|(?,Ai(t')  since  the  entropy  function  H{-)  is  strictly  continuous  over  Mi(E)  (for  |S|  < 
oo). 

We  turn  now  to  prove  the  upper  bound  of  (2.1.33).  We  first  deduce  from  (2.1.31)  that  for  some 
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infinite  subsequence  A/;.,  there  exist  Vk  €  T  such  that 


limsup  -  log  Prob(Z,^  G  F)  =  -  lim  (r-t)  =  /*  . 

*00  ^  k'—co 


(2.1.34) 


Moreover,  the  sequence  {uk}  has  a  limit  point  u‘  in  the  compact  set  F.  Passing  to  a  convergent 
subsequence,  the  lower  semicontinuity  of  I  jointly  in  /3,  p  and  u  implies  that 


F*  <  <  -  inf  •  (2.1.35) 

i/er 

The  proof  of  the  upper  bound  is  complete  in  view  of  (2.1.34)  and  (2.1.35). 

In  view  of  (2.1.32)  it  suffices  for  the  lower  bound  of  (2.1.33)  to  prove  that 


—  lim  sup  {  inf 
.\/_co  i-'ernCn 


(2.1.36) 


for  any  €  F°  where  IpA^)  <  ^  (see  (1.1.7)  in  Section  1.1  for  this  formulation  of  the  large 
deviations  lower  bound).  Fi.x  such  a  point  o.  i.e.,  such  that 


p{ai)  >  (iviai)  Va,- €  S  .  (2.1.37) 

In  particular.  Si.  C  S^  and  as  i/  6  F°  as  well,  u  belongs  to  the  relative  interior  of  F  in  Mi(S^). 
Further,  (2.1.36)  follows  from 


-  limsup  {  inf 

M—oo  i^'ernw,(i;,.)nz:„ 


Therefore,  one  may  assume  without  loss  of  generality  that  S  =  S^.  Now,  since  /3  <  1  there 
always  exists  a  sequence  of  probability  vectors  Uk  6  F°  which  converges  to  i/  and  for  which  all  the 
inequalities  in  (2.1.37)  are  strict.  As  is  continuous  along  any  such  sequence,  we  deduce  that 

it  suffices  to  prove  (2.1.36)  for  u  £  T°  for  which 


min {/i(a, )-/?;/( a,)}  >  0  .  (2.1.38) 

By  the  same  argument  as  in  the  proof  of  Theorem  2.1.1  there  exists  a  sequence  €  F  H  Cn  such 
that  Vn  —  u  as  n  ^  cc.  Now,  since  p  and  —  .3  the  strict  inequality  (2.1.38)  implies  that 

for  all  M  sufficiently  large 

min{Tj(^(a,)  -  ^t/n(a,)}  >  0 


23 


and  then  Isl.  {Un)  <  co  (note  also  that  -fj  is  eventually  bounded  away  from  0  and  1).  Therefore, 
by  the  strict  continuity  of  I  along  such  sequences 

=  lim  /n  (i^n)  . 

A/— CO 

Since  t'n  S  F  n  , 

limsup  {  inf  }  <  Jim  In^x^) 

and  the  inequality  (2.1.36)  now  follows.  □ 

Exercises: 

2.1.10  Prove  Corollary  2.1.1. 

2.1.11  Let  Ir  =  inf^gpo  Prove  that  when  /r  <  oo  and  F  is  included  in  the  closure  of  its 

interior,  then 

1  Y 

/r  =  lim  —  logProb(X„  €  F)  . 

At— CO  n 

Hint:  See  exercise  2.1.2  and  use  the  continuity  of  within  its  level  sets. 

2.1.12  Prove  that  the  rate  function  1,3, is  a  convex  function. 

2.1.13  Prove  that  the  large  deviations  principle  of  Theorem  2.1.3  holds  for  /3  =  0  with  the  good 

rate  function  /o,;i(-)  =  provided  that  n  —  oo. 

Hint:  First  prove  that  the  left  side  of  (2.1.30)  goes  to  zero  as  long  as  n  — *  oo  (i.e.,  even  when 
_  oo).  Then  prove  the  lower  semicontinuity  of  at  /3  =  0  and  use  it  to  derive 

the  upper  bound.  Finally,  it  suffices  to  prove  the  lower  bound  when  C  and  so  the  sequence 
Vn  €  n  F  n  £„  will  converge  to  v  while  eventually  £.x  {i>n)  <  oo. 

.Vf  ’  M 


2.2  Cramer’s  Theorem  in  IR^ 

Cramer’s  theorem  about  the  large  deviations  associated  with  the  empirical  means  of  i.i.d.  random 
variables  is  presented  in  Section  2.1.2  as  an  application  of  the  method  of  types.  However,  the 
method  of  types  is  limited  in  its  scope  to  finite  alphabets.  Moreover,  it  neither  explains  why  a 
large  deviations  principle  is  satisfied  nor  predicts  which  rate  functions  one  should  expect  in  similar 
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situations.  In  this  section,  we  start  pursuing  a  different  route  aiming  at  proving  the  analog  of 
Theorem  2.1.2  when  the  underlying  alphabet  D  is  IR^  The  approach  outlined  here  is  more  amenable 
for  generalizations  and  better  illustrates  the  main  ingredients  involved  in  proving  a  typical  large 
deviations  principle. 

Consider  the  empirical  means  ^  ^j-,  where  the  real  valued  random  variables 

Xi, . . . •  •  •  are  independent  and  identically  distributed  according  to  the  marginal  probability 
law  pL  €  iW’i(lR).  Let  denote  the  law  of  S-n  and  x  =  Efi[X\]  denotes  the  true  underlying  mean. 
If  we  further  a.ssume  that  i  <  oo  and  £f|Xi  —  S’p]  <  oo  then  — i-  x  since 

‘  n— -OO 

E[{Sr.  -  x)2]  =  -  x|2]  0  .  (2.2.39) 

j=i 

Actually,  the  finite  variance  condition  is  not  needed  for  the  convergence  in  probability  but  we  will 
not  care  about  that  here.  Because  of  (2.2.39),  PniA)  — '  0  for  any  set  A  such  that  T  4  A. 

n— *oo 

Cramer’s  theorem  characterizes  the  logarithmic  rate  of  this  convergence  by  the  (rate)  function 

.\‘‘(i)  =  sup  [Ax  -  A(A)]  ,  (2.2.40) 

AeR* 

where 

A(A)  =  log  M{X)  =  log  ,  (2.2.41) 

is  also  called  the  logarithmic  moment  generating  function.  Note  that  while  A(A)  >  — oo  it  is. 
possible  to  have  A(A)  =  oo.  Let  Pa  =  {A  :  A(A)  <  oo}  and  Pa*  =  {2;  :  A*(x)  <  co}. 

The  rate  function  A*(x)  of  (2.2.40)  is  called  the  Legendre  transform  of  A(A).  Some  of  the 
properties  of  this  function  (and  of  A(A))  which  are  useful  when  proving  Cramer’s  theorem  are 
summarized  in  the  following  lemma  whose  proof  is  deferred  to  the  end  of  this  section.  The  exact 
definition  of  Legendre  transform  and  its  properties  for  more  general  vector  spaces  are  presented  in 
Sections  2.3  (for  21“^)  and  ??  (for  a  general  class  of  metric  spaces). 

Lemma  2.2.1 

(a) .  A  is  a  convex  function  and  A*  is  a  convex  rate  function. 

(b) .  A’(x)  =  0  and  for  x  €  [x,  00), 

y\.*(x)  =  sup[Ax  -  A(A)] ,  (2.2.42) 

A>0 
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is  nondecreasing.  Similarly,  for  x  €  ( — 


A“(x)  =  sup[Ax  -  A(A)]  (2.2.43) 

A<0 

and  is  nonincreasing. 

(c).  For  any  rj  £  V\ 

=  (2.2.44) 

and 

A'(7/)  =  y  =>  A'{y)  =  vy-  Mv)  (2.2.45) 

The  following  definition  is  needed  for  the  precise  statement  of  Cramer’s  theorem. 

Definition  2.2.1  The  logarithmic  moment  generating  function  A{-)  is  1/ liminfn_oo  |A'(An)i 

CO  for  any  sequence  A„  £  V°^  which  converges  to  X  £  T>\  \  V\ . 

A  weak  version  of  Cramer’s  theorem  is  proved  below,  establishing  the  large  deviations  principle  in 
IR^  when  A(-)  is  a  steep  function  and  0  G  Vf.  For  e.xample,  both  of  these  conditions  hold  when 
Pa  =  It  is  possible  to  establish  the  large  deviations  principle  in  IR^  with  no  condition  on  A(A) 
by  adopting  a  more  sophisticated  proof  based  on  sub-additivity  and  convexity  arguments.  This  is 
pursued  in  much  generality  in  Sections  ??  and  ??  where  the  exact  consequences  for  the  special 
case  discussed  here  are  given  in  exercise  ??. 


Theorem  2.2.1  (Cramer) 

(a).  The  large  deviations  upper  bound 


limsup  —  log/i„(A)  <  -  inf_A‘(x)  , 

n— .OO  n  x&A 

holds  for  any  measurable  A  C  IR^. 

(b).  For  any  open  set  G  C  IR^ 


(2.2.46) 


liminf  —  log/i„((?)  >  -  inf  _A‘(z)  ,  (2.2.47) 

where  T  =  {x  :  x  =  A'(A)  for  some  A  £  Pa)- 

(c).  IfV^-  C  F  then  the  family  of  probability  measures  Pn  satisfies  the  large  deviations  principle 
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with  the  rate  function  A*(i).  In  particular,  satisfies  the  large  deviations  principle  ifVx  =  IR^. 
(d).  If  0  G  V\  and  A  is  a  steep  function  then  the  family  of  probability  measures  /x„  satisfies  the 
large  deviations  principle  with  the  good  rate  function  A*(x). 

Remark:  Tlie  Legendre  transform  A*(i)  is  a  natural  candidate  for  the  rate  function  under  a  very 
general  situation  where  the  upper  bound  (2.2.46)  holds  (for  more  on  this  issue,  see  Sections  2.3 
and  ??).  For  that  reason,  convexity  plays  an  important  role  in  large  deviations  as  seen  throughout 
Chapters  ??-??. 


Proof: 


(a).  If  X  6  A  then  the  upper  bound  (2.2.46)  trivially  holds  (as  A*(x) 

=  0).  Otherwise,  let  (x_,x+) 

be  the  largest  interval  around 

X  such  that  (x.jX^.) 

n  A  =  0. 

Since 

A  C  (-oo,x_]  U  [x+, 

llt> 

1 

U  A+  , 

it  suffices  to  prove  that 

Pn{A+)  <  e' 

-nA*  (r-j. ) 

(2.2.48) 

and 

Pn(A-)  <  e 

-nA*  (x-) 

(2.2.49) 

in  order  to  prove  the  upper  bound.  Indeed,  then 

yn{A)  <  Pr.iA+)  +  PniA.)  <  +  e-«A-(x_)  <  2e-"™"{A-(x-),A-(x+)}  _ 

By  part  (b)  of  Lemma  2.2.1 

min{A*(x_),  =  inf_A*(x) 

x6/l 

and  the  proof  of  the  upper  bound  follows. 

Returning  now  to  prove  (2.2.48)  and  (2.2.49),  these  are  consequences  of  a  special  form  of 
Markov’s  inequality.  Clearly,  for  any  A  >  0, 

=  6-"^^+ nr=i  (2.2.50) 

Therefore, 

Pn{A+}  < 


and  (2.2.48)  follows  by  (2.2.42)  as  2:+  6  [af,  cc).  The  proof  of  (2.2.49)  is  similar. 

(b).  The  lower  bound  of  (2.2.47)  trivially  holds  when  G  n  .F  is  an  empty  set.  Moreover,  since  G  is 
an  open  set  it  suffices  to  show  that  for  any  x  G  and  any  open  interval  Bx,6  of  center  at  x  (and 
width  25) 

lim  inf -log  Ain  (-Sr, 5)  >  -A*(2;)  .  (2.2.51) 

Consider  first  points  y  £  !F.  Fix  such  a  point  y  =  A'{t])  £  T  and  fix  5  >  0.  Note  that  ij  £  and 
define  the  associated  probability  measure  (see  [12]) 

ii{dx)  =  ii{dx)  . 

Indeed  /i  is  a  probability  measure  as  fl{dx)  =  fi{dx)  =  1.  Define  accordingly  /2„ 

to  be  the  law  governing  Sn  when  A",  are  i.i.d.  random  variables  of  law  p,.  Note  that  by  (2.2.44) 

^  xe^^^idx)  =  A'irj)  =  y  . 

Thus,  the  mean  of  .Y,  under  p  equals  y  and  by  the  weak  law  of  large  numbers, 

Y^pniBy^s)  =  I  ■  (2.2.52) 

Moreover, 

f^n{By,s)  =  /  dMa:i)---dMxn)  >  ■  ■ -dfiM 

s/jSn— yi<5 

=  =  /l„(5y,5)e-"^*(^)e-"l'’l^  (2.2.53) 

where  the  last  equality  follows  by  (2.2.45).  Therefore,  by  combining  (2.2.52)  and  (2.2.53) 

li^inf  ^logAinlSj/.i)  >  -A'(y).-  \t]\S 

Take  now  6^  —  0,  5,n  <  <5-  Then,  from  the  above, 

li^inf  i  log  fJ.n{By,s)  >  inf  ^  log  Ai„ ( By^s,n ) 

>  -A-(2/)-|77|5„,  —  -A*(y) 

m— *00 

and  the  proof  of  (2.2.51)  is  complete  for  y  £  'F . 
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The  inequality  (2.2.51)  trivially  holds  when  x  (L  P.v,  while  when  x  €  Pa*  H  JT  there  exist 
Ur  €  ^  Q  such  that  ijr  —  x.  Since  A'(-)  is  a  convex  rate  function,  A*(j/r)  —  -'^“(x)  and  aa  Bx,s 
is  an  open  set,  eventually  ^  G  for  some  Sr  >  0.  Finally,  apply  (2.2.51)  for  yr  e  to  obtain 

liminf  —  log/r„(.Sr,i)  >  lim  liminf  —  log/u„(5y^,5^)  >  -  lim  A“(yr)  =  -A“(x)  . 

n— *oo  72,  r— *oo  n— oo  ji  r-^oo  ' 

(c+d).  If  Pa*  C  JF  then  for  any  G  C  IR^ 

—  inf  A’(x)  >  —  inf  A*(x)  =  —  inf  A’'(x) 

reGnF  -  xeGnP^.  x6G 

The  large  deviations  principle  now  follows  by  combining  (2.2.46)  and  (2.2.47).  The  claim  is  a  direct 
consequence  of  parts  (b)  and  (c)  in  Lemma  2.2.2  below.  □ 

Remarks: 

(a) .  The  crucial  step  in  the  above  proof  of  the  upper  bound  is  the  exponential  form  of  Markov’s 
inequality  combined  wdth  the  independence  assumption  by  which  one  can  decompose  the  bounding 
exponent.  For  weakly  dependent  random  variables  a  similar  approach  is  to  use  the  logarithmic 
limit  of  the  left  side  of  (2.2.50)  instead  of  the  logarithmic  moment  generating  function  for  a  single 
random  variable.  This  is  further  e.xplored  in  Section  2.3. 

(b) .  The  crucial  step  in  the  above  proof  of  the  lower  bound  Is  the  exponential  change  of  measure 
applied  when  defining  the  associated  measure  fi.  This  is  particularly  well  suited  to  problems 
where,  even  if  the  random  variables  involved  are  not  directly  independent,  some  form  of  underlying 
independence  exists  (e.g.,  when  a  Girsanov  type  formula  can  be  used,  as  in  Chapter  ??).  Unless 
coupled  with  some  convexity,  this  argument  may  in  general  fall  in  the  dependent  case.  For  more 
about  it,  c.f.  Section  ??. 

The  following  lemma  lifts  some  of  the  mystery  behind  the  condition  2?a*  Q 
Lemma  2.2.2 

(a) .  T  C  I?A*  ^oth  are  intervals.  Also  0  €  X>a  which  is  an  interval. 

(b) .  Z/O  6  T^a  ^  steep  function  then  V\‘  C  T . 

(c) .  1/0  6  7?a  ®  good  rate  function. 

Remark:  Both  the  steepness  of  A  and  the  condition  0  €  7?^  are  necessary  for  Pa*  Q  ^  (for  details 
see  exercise  2.2.3). 
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Proof  of  lemma  2.2.1: 

(a).  The  convexity  of  A  follows  by  Holder’s  inequality,  as  for  any  d  €  [0, 1] 

A(0Ai+(1-0)A2)  =  log  ](!-<’)}  =  0A(AO+(1-0)A(A2) 

Clearly  A(0)  =  logi;[l]  =  0,  so  A’(x)  >  Ox  -  A(0)  =  0.  Let  x„  x.  Then  for  any  A  €  H, 


liminf  A'(x„)  >  liminf[Ax„  -  A(A)]  =  Ax  —  A(A) 

Xji^^X  Xfx-^X  " 

and  thus 

liminf  A*(x„)  >  sup  [Ax  -  A(A)]  =  A*(x)  , 

,\eR 

establishing  the  lower  semicontinuity  of  A‘.  Thus,  A“  is  a  rate  function. 
The  conve.xity  of  A*  follows  by  definition  as 


^A*(xi)  +  (1  -  ^)A’(x2)  =  sup{0Axi  -  M(A)}  +  sup{(l  -  6')Ax2  -  (1  -  (9)A(A)} 

AeR  A6R 

>  sup{(0xi  +  (1  -  6>)x2)A  -  A(A)}  =  A*(6»xi  4- (1  -  6')x2)  . 

AeR 

(b) .  By  Jensen’s  inequality, 

A(A)  =  logi;[e'^-'^‘]  >  L;[loge-'-^'']  =  Ax  , 

for  any  A  6  ffi.  and  thus  A”'(x)  =  0  (this  should  have  been  e.xpected  in  view  of  (2.2.39)). 

Suppose  now  that  x  €  [x,  oo).  Then,  for  any  A  <  0 

Ax  —  A(A)  <  Ax  —  A(A)  <  A’(x)  =  0  , 

and  (2.2.42)  follows  since  A"(x)  is  non-negative.  Moreover,  A*(x)  is  nondecreasing  for  x  >  x  since 
for  any  A  >  0  the  function  g{x)  =  Ax -.4(A)  is  nondecreasing.  The  proof  of  (2.2.43)  for  x  6  (-oo,x] 
is  similar. 

(c) .  The  identity  (2.2.44)  follows  by  interchanging  the  order  of  differentiation  and  integration.  This 
is  justified  by  the  dominated  convergence  theorem  since  for  e  small  enough 

£'[1-^1  (Miv  +  e)  +  M{t]  -  e))  <  oo  . 

Let  A'(r?)  =  y  and  consider  the  function  g{X)  =  Ay  -  A(A).  As  g{-)  is  a  concave  function  and 
g'{g)  =  0  clearly  g{g)  =  sup^^jp^  y(A)  and  (2.2.45)  is  established.  □ 
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Proof  of  Lemma  2.2.2: 

(a) .  The  sets  V\  and  Pa*  are  conve.x  since  the  functions  .A,  and  .4'  are  convex.  Further,  Pa  and 
Pa*  are  intervals  since  any  convex  subset  of  is  an  interval.  It  was  already  noted  that  A(0)  =  0, 
so  0  €  Pa-  Moreover,  .A'(-)  is  nondecreasing  since  A(-)  is  convex.  Now,  since  P^  is  an  interval  so 
is  !F .  Finally,  T  C  Pa*  by  (2.2.45). 

(b) .  Since  0  e  P^  certainly  x  =  A'(0)  €  T  and  if  Pa*  =  {x}  the  proof  is  completed.  Otherwise, 

Pa*  is  a  non-empty  interval.  Consider  now  some  x  >  x  and  suppose  that  x  €  P^*-  Recall  that 
A’'(a:)  =  sup;^gp+  ^(A)  where  the  concave  function  g{X)  =  Xx  -  A(A)  is  continuous  within  the 
interval  =  Pa  H  [0,oo)  and  differentiable  in  the  non-empty  interior  of  Pj.  Consequently, 
A*(x)  =  limr^ooffCAr)  for  some  positive  sequence  A,.  6  V\  such  that  ^'(Ar)  >  0.  Further,  the 
sequence  {A^}  is  bounded  since  A’(x  -h  f)  <  oo  for  some  e  >  0,  and  as  such  has  a  limit  point, 
say  A”.  Passing  to  the  convergent  subsequence,  A'  ^  P^  implies  by  the  steepness  of  A(-)  that 
Limr— CO  =  -oo.  This  contradicts  the  above  requirement  that  ^'(Ar)  >  0.  Therefore,  A*  G  V\ 

implying  x  =  A'(A*),  i.e..  x  G  JF.  .4  similar  proof  applies  for  z  <  x  so  P^.  C  Since  Pa*  is  an 
interval  of  non-empty  interior  it  then  follows  that  V,\-  C  7. 

(c) .  If  0  G  Pa  then  there  exist  A^.  >  0  and  A_  <  0  which  are  both  in  Pa.  Since  for  any  A  G  IR^ 


|x| 


>  Asign(x)  - 


A(A) 

1x1 


it  follows  that 

A*(x) 

liminf  ,  ,  >  min{A+,  —  A_}  >  0  . 

Thus,  in  particular  A”(x)  — '  co  and  its  level  sets  are  bounded.  Recall  that  closed  and  bounded 

|j;|— »CO 

subsets  of  IR^  are  compact,  so  .A"  is  indeed  a  good  rate  function.  □ 


Exercises: 


2.2.1  Suppose  that  Pa  =  IR^  Prove  that  /i({x})  =  when  x  G  Pa*  \ 

Hint:  Show  first  that  there  exists  {Xg}  such  that  lA^I  co  and 


2.2.2  Explain  why  the  above  proof  of  (2.2.51)  may  not  work  when  x  g  Pa*  \  -P. 
Hint:  Try  distributions  with  density  of  the  form  Ce~''l®l/(1  +  1x|p)  for  appropriate  p. 
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2.2.3  (a).  Prove  that  if  A(A)  =  oo  for  all  A  >  0  and  x  <  co,  then  A”'(x)  =  0  for  any  x  >  x,  while 
T  C  (— oo,2rj. 

(b).  Suppose  now  that  Pa  =  (-oOiAo]  where  Aq  >  0  and  limA^Ao  A'(A)  —  xq  <  oo.  Prove  that 
A*(a;)  =  A*(xo)  +  —  2:0)  for  any  a;  >  xq  while  T  C  (—00.  xq]. 

2.2.4  Prove  that  A(-)  is  lower  semicontinuous. 

Hint:  Suppose  that  iV/(A)  =  00  for  some  A  >  0  while  A  —  ^  €  V\  for  ail  <5  >  0.  Let  dG{x)  = 
e^^dfi{x)  and  observe  that  lim5_-oG(|-)  =  00.  By  integration  by  parts  show  that  M{X  -  ^)  > 
e~^G(j)  — *•  00.  Conclude  the  proof  using  the  convexity  of  A(-). 

2.2.5  (a).  Prove  that  A(A)  is  in  V\  and  that  A*(x)  is  strictly  convex  and  C°°  in 
Hint:  Show  that  x  =  A'(r;)  6  implies  that  A"(t7)  >  0. 

(b).  Construct  an  example  where  V\  =  IR^  while  .\"(-)  is  discontinuous. 

Hint:  Use  a  binary  valued  random  variable. 

2.2.6  Show  that 

(a)  For  a  Poisson(0)  random  variable  A“(x)  =  ^  -  x  +  xlog(|)  for  x  >  0  and  A*(x)  =  00 
otherwise. 

(b)  For  ATi  a  Bernoulli(p)  random  variable  .'\.”(x)  =  i  iog(|)  +  (1  -  x)  log(Yz|)  for  x  6  [0, 1]  and 
A'(x)  =  00  otherwise. 

(c)  For  Xi  an  Exponential  {d)  random  variable  .A”'(x)  =  ^x  — 1— log(0x)  for  x  >  0  and  A*(x)  =  00 
otherwise. 

(d)  For  Xi  a  Normal  (0,(7^)  random  variable  A*(x)  = 

Verify  that  a  large  deviations  principle  holds  in  all  these  cases. 

2.2.7  Let  (x_,X4.)  be  the  largest  interval  around  x  such  that  (x_,x+)n  A  =  0.  Define 
I  A  =  min{A*(x_),  A*(x+)}  and  x‘  any  point  at  which  this  minimum  is  obtained. 

(a).  Prove  that 

lim  -log;i„(A)  = -/.A  (2.2.54) 

n— *00  n  ' 
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whenever  I,\  =  co. 

(b).  Prove  that  (2.2.54)  also  holds  for  <  oo  if  x~  €  .4'^  n 

Hint:  Let  n  be  a  sequence  which  converges  to  x‘  and  apply  (2.2.51)  in  order  to  verify 

that  liminfn^oo  ^log/in(^)  >  -  linir—oo  A*(Xr)  =  -A*(x''). 


2.2.8  Assume  that  a  <  Xi  <  b 


(a)  Show  that  for  any  A 


A/(A)  <  ? — +  T — 


b  —  a  b  —  a 

This  is  the  main  ingredient  of  Hoeffding's  inequality. 

(b)  Use  (a)  to  show  that  for  any  a  <  x  <  b 


A‘(x)  >  H{ 


X  —  a,  X  —  a. 


b  —  a  b  —  a  ’ 

where  H{p\po)  =  ploglp/jJo)  +  (1  -  p)log((l  -  p)/(l  -  Pa)). 

(c)  Prove  that  the  inequality  (2.2.55)  is  sharp. 

2.2.9  Assume  that  x  =  £'[A'i]  is  finite,  A'l  <  6  and  Far( AT)  < 


(a)  Show  that  for  any  A  >  0 

iV/(A)  < 


(6  —  x)^  ' . 


/{b-x) 


(6  —  x)‘  +  cr^  (6  —  x)2  q-  cr^ 

This  is  the  main  ingredient  of  Bennet’s  inequality. 

(b)  Use  (a)  to  show  that 

A'(x)  >  /f(Pr|pr)  , 

for  any  X  <  X  <  6  where  p^  =  • 

(c)  Prove  that  the  inequality  (2.2.56)  is  sharp. 


=A(6-r) 


(2.2.55) 


(2.2.56) 
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2.3  A  general  large  deviations  principle  in 

Cramer’s  theorem  2.2.1  possesses  a  multivariate  counterpart  dealing  with  the  large  deviations  of 
the  empirical  means  of  i.i.d  real  vectors  in  IR'^.  .4ctuaiiy,  the  essential  elements  in  the  proof  of  this 
theorem  extend  to  a  more  general  class  of  dependent  random  vectors.  This  is  e.xplored  here,  where 
emphasis  is  put  on  the  new  points  in  which  IR'’^  differs  from  IR^  with  an  eye  towards  possible  infinite 
dimensional  extensions. 

Key  applications  are  of-course  Cramer’s  theorem  in  IR*^  which  is  presented  in  Theorem  2.3.2 
and  its  consequence  -  Sanov’s  Theorem  for  finite  alphabets  (see  Corollary  2.3.1  and  e.xercise  2.3.1). 
Some  simple  non  i.i.d.  applications  are  left  as  exercises  2.3.6,  2.3.8  and  2.3.9  while  Section  2.4 
is  devoted  to  another  class  of  key  applications  -  the  large  deviations  of  Markov  chains  over  finite 
alphabets. 

The  set-up  considered  here  consists  of  a  sequence  of  random  vectors  G  IR'^  with  laws  /in  and 
logarithmic  moment  generating  functions 

.\n(A)  =  log  E  (2.3.57) 

where  <  X.x  >=  J2j=i  the  usual  scalar  product  in  IR*^  with  the  j-th.  coordinate  of  the 

point  X  G  IR*^  and  jxj  =  x,x  >  the  usual  Euclidean  norm. 

The  existence  of  the  limit  of  the  properly  scaled  logarithmic  moment  generating  functions 
indicates  that  fin  may  satisfy  a  large  deviations  principle.  Specifically,  the  following  assumption 
prevails  throughout  this  section. 

Assumption  2.3.1  There  exists  a  sequence  of  constants  a-n  — -  0  such  that  for  each  A  G  IR*^, 

n— *00 

the  limit 

A(A)  =  ^li^  anAn(a“'A)  (2.3.58) 

exists  (possibly  as  a  point  in  [-oo,  oo]j.  Further,  0  G  7?^  where  "Da  is  the  domain  in  which  A(-)  <  oo. 

For  example,  when  pn  is  the  law  governing  the  empirical  mean  A  of  the  i.i.d.  random  vectors 
Xi  G  IR'^  then  for  any  n  ^  Z'^ 

i.4n(nA)  =  A(A)  i  log  (2.3.59) 

n 
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and  the  above  assumption  holds  whenever  0  6  V\. 

Definition  2.3.1  The  Fenchel- Legendre  transform  of  A{X)  is 

A'{x)  =  sup  (<  X,x  >  -A(X)),  (2.3.60) 

with  denoting  the  domain  in  which  A*(-)  <  oo, 

Cramer’s  theorem  suggests  that  A*  is  the  natural  candidate  rate  function  for  governing  the  large 
deviations  principle  associated  with  This  is  indeed  proved  in  Theorem  2.3.1  under  an  addi¬ 
tional  condition.  The  properties  of  the  functions  A  and  A*  which  are  needed  for  that  purpose  are 
summarized  in  the  following  lemma  whose  proof  is  deferred  to  the  end  of  this  section. 

Lemma  2.3.1  Assume  2.3.1. 

(a) .  A(A)  is  a  convex  function.  A(A)  >  — oo  everywhere  and  A’"(x)  is  a  good,  convex  rate  function. 

(b) .  Pa  Pa*  o.re  convex  sets. 

(c) .  Suppose  that  y  =  VA{g)  for  some  p  6  P^-  Then 

A’{y)  =<  p,y  >  -A{p)  .  (2.3.61) 


(d).  Let  y,p  be  as  in  (c)  above.  Then,  for  any  x  y 

A;{x)  >  A;iy)  =  0  ,  (2.3.62) 

where  A*(-)  is  the  Fenchel-Legendre  transform  of 

Ar,(e)  =  A{e  +  p)  -  A(p)  .  (2.3.63) 

Remark:  These  convex  analysis  considerations  are  addressed  again  in  Section  ??  in  a  more  abstract 
setup. 

The  general  large  deviations  principle  in  is  now  summarized  as  follows. 

Theorem  2.3.1  (Gartner)  Assume  2.3.1.  Then 
(a). For  any  closed  set  F 

limsup  a„log  ^t„(F)  <  -  inf  A*(x) .  (2.3.64) 

n— oo  r6F 
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(b) .  For  any  open  set  G 

liminf  a„  log  PniG)  >  -  inf_A’(x) ,  (2.3.65) 

xeGnT 

where  F  =  {x  •.  x  ■=  VA(A)  for  some  A  S  ’^a}- 

(c) .  IfV\>  C  F  then  the  family  of  probability  measures  pn  satisfies  a  large  deviations  principle 
controlled  by  the  good  rate  function  A“(x). 

The  following  lemma,  whose  proof  is  also  deferred  to  the  end  of  this  section,  makes  the  above 
theorem  applicable  by  stating  explicit  conditions  on  A  under  which  Pa*  Q  F. 

Lemma  2.3.2  (Ellis)  Suppose  that  A(A)  which  satisfies  2.3.1  is  a  lower  semicontinuous  function 
which  is  differentiable  in  V\  and  moreover  it  is  a  steep  function,  namely  for  any  convergent 
sequence  A„  G  P^  whose  limit  does  not  belong  to  lim„_oo  I^A(An)|  =  oo.  Then,  Pa*  C  F. 

The  proof  of  Theorem  2.3.1  is  given  in  the  sequel,  preceded  by  the  statement  and  derivation  of 
a  key  application  -  Cramer's  theorem  about  the  empirical  means  of  i.i.d.  random  vectors  in  IR"^. 

Theorem  2.3.2  Let  pn  bs  the  laws  governing  the  empirical  means  Sn  =  ^  where  Xi  €  IR'^ 

are  i.i.d.  distributed  according  to  the  law  p.  Suppose  that  the  logarithmic  moment  generating 
function 

A(A)  =  log£:^  ,  (2.3.66) 

is  a  steep  lower  semicontinuous  function  which  is  finite  in  some  open  ball  centered  at  the  origin  (in 
particular  these  conditions  hold  when  V\  =  IR*^).  Then,  for  any  measurable  set  F  C  IR'^ 

-  inf  A’'(a;)  <  liminf  —  log^n(r)  <  limsup  —  log/z„(r)  <  —  inf_A’(a:)  ,  (2.3.67) 

r6r°  'i—oo  n  n— oo  n 

with  A*  being  the  Fenchel- Legendre  transform  of  A. 

Proof:  Recall  that  in  this  case  the  basic  assumption  2.3.1  holds  (see  (2.3.59))  and  indeed  A  of 
(2.3.58)  is  given  by  (2.3.66).  In  the  statement  of  the  theorem  it  is  further  assumed  that  A  is  a  steep 
function.  It  follows  from  the  definition  (2.3.66)  that  A  is  differentiable  in  V%  (by  an  argument 
which  is  similar  to  the  proof  of  (2.2.44)  in  Section  2.2).  Thus,  Lemma  2.3.2  applies  and  (2.3.67) 
follows  by  Theorem  2.3.1.  □ 
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Remark  The  assumption  of  lower  semi-coiuiiuiity  may  be  relaxed.  For  details,  c.f.  exercise 


2.3.3. 


The  essential  ingredients  for  the  proof  of  Theorem  2.3.1  are  those  presented  in  Section  2.2  in  the 
course  of  proving  Theorem  2.2.1,  namely  -  the  exponential  form  of  Markov’s  inequality  is  applied 
for  deriving  the  upper  bound  and  an  exponential  change  of  measure  is  used  for  deriving  the  lower 
bound.  However,  here  one  encounters  two  new  obstacles  which  slightly  complicate  both  parts  of 
the  proof. 

Proof  of  Theorem  2.3.1: 

(a).  In  IR'^  the  monotonicity  of  A*  stated  in  Lemma  2.2.1  part  (b)  is  somewhat  lost.  Thus,  the 
strategy  of  containing  A  by  two  half-spaces  /l_  and  A+  is  not  as  useful  as  it  is  in  Instead, 
here  one  uses  Markov’s  inequality  to  obtain  tight  upper  bounds  for  all  small  closed  balls.  Then, 
compact  sets  are  covered  by  an  appropriate  finite  collection  of  small  enough  balls  and  the  upper 
bound  follows  for  compact  sets  by  the  union  of  events  bound. 

.4s  mentioned  in  Section  1.1.  proving  (2.3.64)  is  equivalent  to  proving  that  for  any  ^  >  0  and 
any  closed  set  F  C  21“^ 

limsup  a„log  finiF)  <  6  —  inf  I^ix)  (2.3.68) 

re— CO  x€F 

where 

-i,  ,  A  f  A'(i)  -6  X  €  Pa- 
(3;)  =  <  , 

(  I  ^  Pa-' 

Fix  now  ^  >  0  and  an  arbitrary  compact  set  F.  For  any  9  6  F  choose  A,  G  Pa  for  which 

<  > -A(A,)  > /^(g)  (2.3.69) 

Tliis  is  always  possible  by  the  definitions  of  .4*  and  I*.  Choose  now  Pq  >  0  such  that  Pij|A,|  <  S 
and  let  be  the  open  ball  with  center  at  the  point  q  and  radius  p,  with  the  corresponding 
closed  ball. 


Then,  for  any  n  and  any  q  €  T 


Mn(P,,pJ  <  exp  -  inf  <  A,,x  >}  e.xp(a;;‘^  <  A,,Z„  >) 

\  J 


Thus, 


anlogp„(Pg,p,)  <  -  inf  <  A,,x  >  4-anA„(a"^ A,)  <  6-  <  A,,g  >  -f-a„A„(a„ ^Ag)  .  (2.3.70) 

^^^q,pq 


37 


As  r  is  a  compact  set,  one  can  extract  from  the  open  cover  [J  of  F  a  finite  cover  which 

consists  of  N  <  CO  (depending  only  on  F  and  S)  such  balls  with  centers  qi, . . .  ,qp/  in  V.  By  the 
union  of  events  bound  and  (2.3.70), 

a„log/z„(F)  <  a„logiV  +  S  -  ^  min  {<  Xq,,qi  >  -a„An(a~^ A,.)}  . 

1=1,. ..,jV 

Since  On  — *  0  as  n  -*  oo  while  a„A„(a~^A^J  —  A(A,,.)  (uniformly  over  i  =  1, . . . ,  N),  one  obtains 
limsup  On  log  /in(r)  <  ^  .  min  {<  A,;,g,-  >  -A(A,,.)}  <  S  -  min  /^(g,-)  , 

n— oo  >=1 ^  i=l,...,iV 

where  the  last  inequality  follows  from  (2.3.69),  As  g,-  6  F  the  upper  bound  (2.3.68)  is  thus  estab¬ 
lished  for  all  compact  sets. 

This  upper  bound  is  extended  to  all  closed  sets  in  IR'^  by  showing  that  is  an  exponentially 
tight  family  of  probability  measures  and  using  Lemma  1.1.1.  Specifically,  it  is  shown  in  the  sequel 
that  for  any  a  <  oo  there  exists  Pa  large  enough  such  that  for  the  compact  set  Ka  =  Bo,p„ 

limsup  a„log;iri(/vQ)  <  —a  (2.3.71) 

n— OO 

For  that  purpose,  observe  first  that  'B^^pd  C  >  p}.  Therefore,  by  the  union  of  events 

bound 

(J-n(Bl,pd)  <  >  p}) 

d  ■ 

-  H  {Mn([p,oo))  +  /ii((-oo,-p])]-  (2.3.72) 

where  are  the  laws  of  Z^,  the  coordinates  of  the  random  vector  Zn-  As  Pa  contains  an  open 
bail  around  the  origin  there  exist  >  0  and  OJ  <  0  such  that  A{d'^Uj)  <  oo  and  A(0juj)  <  oo 
where  uj  denotes  the  j-th  unit  vector  in  for  j  =  1, . . . ,  d.  By  the  exponential  form  of  Markov’s 
inequality,  for  j  =  1, . . . ,  d 

limsup  a„logp{([p,oo))  <  -5)^p  +  limsup  anAn(a~^9f  Uj)  =  -t- A(^+Uj)  .  (2.3.73) 

n— *00  n— *00 

Similarly, 

limsup  a„logp^((-oo,-p])  <  5jp  + A(0“Uj)  .  (2.3.74) 

n— ►OO 
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Now,  the  inequality  (2.3.71)  results  by  combining  (2.3.72),  (2.3.73)  and  (2.3.74)  and  considering 
p  -*  oo. 

(b).  Focusing  now  on  establishing  the  lower  bound  (2.3.65)  for  any  open  set,  it  suffices  to  prove 

limliminf  flnlog  /i„(Py,5)  >  -A'(t/)  ,  (2.3.75) 

S—Q  n— oo 

for  any  y  €  Indeed,  (2.3.75)  then  holds  for  any  y  6  and  (2.3.65)  follows  by  the  same  argument 
encountered  in  that  leads  from  (2.2.47)  to  (2.2.51)  in  the  proof  of  Theorem  2.2.1. 

Fix  now  y  =  VA(77)  6  ^  with  V  e  V%.  Then,  for  all  n  large  enough,  An{a-^v)  <  ^  and  the 
“associated”  probability  measures 

d/j„(z)  =  e.Kp  K'  <rj,z>  -An{a-^V)]  d^in{z)  (2.3.76) 


are  well  defined.  Clearly, 


a„logMn(Bv.i)  =  '!)-<  I.  V>+»n  log 

>  a„An(a;' ri)-  <  r],y  >  -\ri\6  +  a„  log  fin{By,s)  (2.3.77) 


Therefore, 


limliminfan  log  pn(By^s) 

S—,0  n— oo 


>  A{r])~  <  r],y  >  +  lirnli^mf  a„  log/2„(5j,,5) 

=  -A‘(v)  +  lim  liminf  a„  log  An(.By,£) 

5_0  n-*oo 


(2.3.78) 


where  the  above  equality  follows  by  (2.3.61). 

Here,  a  new  obstacle  stems  from  the  removal  of  the  independence  assumption.  Indeed,  V 

as  n  -  oo,  but  now  one  stiU  has  to  establish  the  appropriate  analog  of  the  weak  law  of  large 
numbers.  This  is  handled  in  the  sequel  by  applying  the  large  deviations  upper  bound  for  the 
“associated”  family  of  measures  An-  Indeed,  the  proof  of  (2.3.75)  is  completed  by  showing  that  for 


any  ^  >  0 


lim  sup  On  log  pniBy^s)  <  0- 

n— ►cc 


(2.3.79) 


For  that  purpose  let  A„,„(-)  denotes  the  logarithmic  moment  generating  function  corresponding  to 
the  law  An-  Then,  for  any  0  €  IR.'^ 


anAn,r}i^n 


7'<e,2> 


dpni^) 


anA„(a;H^  +  Tj))  -  a„A„(a;^T7)  A„(0) 
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as  a„A„(a~'77)  <  oo  for  all  n  large  (recall  that  77  €  Z^a)-  Moreover,  A,,(0)  =  0  and  Ar,(^)  <  00  for 
all  1^1  small  enough  since  tj  G  V°^  (c.f.  (2.3.63)).  Thus,  a  large  deviations  upper  bound  of  the  form 
of  (2.3.64)  holds  for  the  sequence  of  measures  /i„.  In  particular,  for  the  closed  set  g  it  yields 

limsup  Un  log  ftn  (-Sy.s)  <  -  inf  A*(i).  (2.3.80) 

n-»oo  ' 

Moreover,  A*(a:)  >  0  for  any  x  ^  y  \n  view  of  (2.3.62)  and  paralleling  the  proof  of  part  (a)  of 
Lemma  2.3.1  one  ea.sily  shows  that  A*  is  a  good  rate  function.  Therefore,  infigsc^  >  0  for 

ail  ^  >  0  and  (2.3.80)  implies  (2.3.79),  concluding  the  proof  of  the  lower  bound  (2.3.65). 

(c).  The  large  deviations  principle  follows  by  combining  (2.3.64)  and  (2.3.65)  as  C  T  implies 
that  for  any  G 

—  inf  .'Mix)  >  —  inf  A'(x)  =  —  inf  A*(x) 
xe.GC\T  £6Gnl>^. 

□ 

Remarks: 

(a) .  The  proof  above  actually  extends  beyond  IR*^  and  in  principle  is  applicable  for  any  metric  space 
(as  shown  in  Chapter  ??).  However,  t%vo  points  of  caution  are  that  the  exponential  tightness  has  to 
be  proved  on  a  case  by  case  basis  and  that  in  infinite  dimensional  spaces  A  is  rarely  differentiable 
and  therefore  the  analog  of  Lemma  2.3.2  typically  fails  to  hold. 

(b) .  In  the  current  set-up  the  condition  0  G  which  is  needed  for  the  above  proof  does  not  follow 
as  a  consequence  of  I?a*  ^  ^ ■  For  this  reason  it  is  incorporated  in  Assumption  2.3.1.  Note  however 
that  the  condition  0  G  V\  is  not  required  at  all  for  proving  the  upper  bound  (2.3.64)  for  compact 
sets. 

.4s  mentioned  in  Section  2.1,  Sanov’s  theorem  for  finite  alphabets,  Theorem  2.1.1,  may  be 
deduced  as  a  consequence  of  Cramer’s  theorem  2.3.2.  Indeed,  note  that  the  empirical  mean  of 
the  random  vectors  Yi  =  [lxi=a, ,  lx,=o2  •  •  •  Ma’,=o|j:|]  equals  the  empirical  measure  of  the 
i.i.d.  random  variables  X\,.  ..,Xn  over  the  finite  alphabet  S.  Moreover,  as  Y)  are  bounded,  here 
Ra  =  and  one  obtains  the  following  corollary  of  Theorem  2.3.2. 

Corollary  2.3.1  For  any  set  F  of  probability  vectors  in  IRI^I 

-  inf  A‘(t')  <  liminf  —  log  Prob^^  (Z^  €  F)  <  limsup  —  log  Probu  (Z^  €  F)  <  —  inf  A*(i/) 

i^€r®  n— *00  Ti  n— ^oo 

(2.3.81) 
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where  A*  is  the  Fenchel- Legendre  transform  of  the  logarithmic  moment  generating  function 

|S| 

A(A)  =  log  =  log^e'^’/i(a,)  (2.3.82) 

t'=i 

and  A  =  (Ai,  A2, . . . ,  A|£|)  £ 

Remark:  Comparing  the  above  corollary  to  Theorem  2.1.1  it  is  tempting  to  conjecture  that 
A*(-)  =  Indeed,  this  is  proved  in  exercise  2.3.1.  .Actually,  as  shown  in  Section  ??  the 

rate  function  controlling  a  large  deviations  principle  in  is  always  unique,  thus  this  result  is  not 
surprising. 

Proof  of  Lemma  2.3.1: 

(a) .  Since  An  are  conve.x  functions  (see  the  proof  of  part  (a)  of  Lemma  2.2.1)  so  are  a„An(an^-)  and 
their  Limit  A(-)  is  convex  as  well.  Moreover.  .An(0)  =  0  and  therefore  A(0)  =  0  implying  that  A*  is 
non-negative.*  Both  the  conve.xity  and  the  lower  semicontinuity  of  A*  follow  from  its  definition  via 
(2.3.60)  (see  proof  of  part  (a)  of  Lemma  2.2.1). 

Now  if  .A(A)  =  —00  for  some  A  6  then  by  conve.xity  A(qA)  =  —00  for  all  a  G  (0,1]. 
Moreover,  since  A(0)  =  0  it  follows  by  conve.vity  that  A(— aA)  =  cc  for  all  a  6  (0, 1]  contradicting 
the  assumption  that  0  G 

Since  0  €  it  follows  that  Bq^s  C  for  some  ^  >  0  and  C  =  sup^^^g^  ^  A(A)  <  00  since  the 
conve.x  function  A  is  continuous  in  V^.  Therefore, 

A’(a:)  >  sup  {<  A,x  >  — A(A)}  >  sup  <  A,x  >  —  sup  A(A)  =  ^|xj  —  C  (2.3.83) 

Thus,  the  level  sets  {x  :  A'(x)  <  a}  are  clearly  bounded  within  a  closed  ball  around  the  origin  (of 
radius  (a  +  C)/S)  and  A*  is  necessarily  a  good  rate  function. 

(b) .  The  convexity  of  2?^  sind  Pa*  is  merely  a  consequence  of  the  convexity  of  A  and  A*  respectively. 

(c) .  Clearly  >  [  <  77,?/  >  — A(77)].  .Assume  that  this  inequality  is  strict,  i.e.,  for  some  A  G  Pa 

[<  A,y  > -.A(A)]  >  [  <  77,y  >  -A(r7)]  .  (2.3.84) 

Since  Pa  is  a  conve.x  set, 

y(a)  =  a  <  A  -  77,  y  >  -A(77  + a(A  -  r;))-!- <  77,2/ >  a  G  [0, 1]  (2.3.85) 
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is  a  finite  vaiued  concave  function.  Thus,  by  concavity 

<7(1)  -  <7(0)  <  liminf  A  -  77,  t/ -  VA(77)  >=  0  ,  (2.3.86) 

where  the  last  equality  follows  from  the  assumption  y  =  VA(77).  However,  the  inequality  above 
contradicts  (2.3.84)  and  therefore  (2.3.61)  is  established. 

(d).  The  function  A*  is  non-negative  (being  the  Fenchel- Legendre  transform  of  A,,  where  A,,(0)  =  0) 
and  moreover 

A*(x)  =  sup  {<  ^,  X  >  -A(^  +  77)}  A(77)  =  A’(i)-  <  77,  x  >  -f  A(77)  .  (2.3.87) 

Thus,  A’(2/)  =  0  by  (2.3.61)  and  7V*(x)  >  A“(y)  for  any  x  ^  y.  If  A'(x)  =  0  for  some  x  then  for 
any  9  €  IR.'^ 

<9,x>  <\r,{9)  .  (2.3.88) 

In  particular  also 

V*  f  ^  /ti  ] 

<9,x><  lim  =<  VA„(0)  >  ,  (2.3.89) 

n— 00  (l/TX) 

where  VA,,(0)  =  Vx\{t})  =  y.  Since  the  inequality  (2.3.89)  holds  for  aU  0  6  IR*^,  necessarily 
X  =  VA,,(0)  =  y.  Thus,  A“(x)  >  0  for  any  x  ^  y  and  (2.3.62)  is  established.  □ 

Proof  of  Lemma  2.3.2: 

Since  V\  is  not  empty,  by  (2.3.61)  so  is  Further,  if  V\-  =  {x}  then  necessarily  x  = 

VA(0)  €  T  and  the  proof  is  then  complete.  Therefore,  we  may  assume  from  here  on  that  the 
convex  set  Pa*  contains  at  least  one  line  so  its  relative  interior 

ri  Pa*  =  {x  6  Pa*  :  y  €  Pa*  =>  x  —  e{y  —  x)  G  Pa*  for  some  e  >  0} 

is  non-empty  and  moreover 

V^.  =  np;^  .  (2.3.90) 

The  sub  differential  5A(A)  of  the  convex  function  A(-)  at  a  point  A  €  IR"^,  is  defined  as 

^^{X)  =  {x  :  A(0)  >  A(A)-f  <  x,  9  -  X  >  V0  €  IR'^}  =  {x  :<  A,  x  >  -A'(x)  >  A(A)}  , 

(2.3.91) 

where  the  second  equality  is  a  direct  consequence  of  the  definition  of  the  Fenchel- Legendre  trans¬ 
form.  It  is  easy  to  check  that  dA{X)  =  0  for  A  ^  Pa  (recall  that  A  >  —00  everywhere)  while 
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otherwise  dA(r])  =  {x  :  A;^(x)  =  0}  (where  A*  is  defined  in  (2.3.63)).  Since  A  is  differentiable 
throughout  (by  assumption),  it  follows  by  (2.3.61)  and  (2.3.62)  that  dA{T})  =  {VA(77)}  for  any 
V  €  VI. 

The  proof  of  the  lemma  is  now  divided  into  the  following  two  steps: 

(a) .  Since  A  is  lower  semicontinuous,  for  any  x  €  ri  Pa*  there  exists  A  €  such  that  x  E  dA(A). 

(b) .  Since  A  is  a  steep  function,  9A(A)  =  0  for  any  A  E  V\  \  V\. 


When  combined,  these  two  claims  result  with 


riPA-C  (J  dA{X)=  IJ  {VA(A)}  =  ;^, 
\eiR^ 

which,  together  with  (2.3.90),  amount  to  the  proof  of  the  lemma. 

(a).  Fix  a  point  x  6  ri  Pa*  and  define  the  function 


gfy)  =  inf  — 
5>o 


‘(x  +  Sy)  -  A'ix) 


=  lim 
5i0 


A*(x  +  Sy)  -  A*(x) 

8 


(2.3.92) 


(2.3.93) 


where  the  convexity  of  A*  results  with  a  monotonicity  in  6  which  in  turn  implies  the  above  equality 
(and  that  the  above  limit  e.xists).  For  the  same  reason  g{y)  is  a  convex  function  and  the  set 
Sg  =  {(y,0  •  ^  siy)}  ^  21“^  X  IR  is  a  conve.x  set.  Further,  g(ay)  =  ag{y)  for  ail  q  >  0  and  in 

particular  g(0)  =  0.  Observe  that  g(y)  =  oo  when  x  +  8y  ^  Pa*  for  aJl  5  >  0.  Consider  therefore 
those  directions  y  such  that  g(y)  <  oo.  Since  x  6  ri  Pa*,  it  then  follows  that  for  some  e  >  0  the 
whole  line  segment  x  +  /3y  for  |/3|  <  e  is  in  Pa*  .  Let  y  =  €y  so  by  the  convexity  of  A* 


A”(x)  <  (1  -  (5)A”(x  +  Sy)  +  8A‘(x  -  (1  -  6)y)  <  co  (2.3.94) 

for  all  5  €  [0, 1].  This  implies 

9{y)  >  lim[A*(x  +  Sy)  —  .A‘{x  -  (1  -  <5)y)]  >  —  A*(x  —  y)  =  -A*(x  —  ey)  >  —oo  (2.3.95) 

where  the  last  two  inequalities  follow  by  the  non-negativity  and  upper  semicontinuity  of  A*,  and  the 
fact  that  X  -  €y  E  Pa*-  Since  g{y)  =  Jt  follows  from  the  above  that  g{y)  >  — oo  for  all  y.  Since 
g(y)  =  \y\9iy/\y\)  S'tid  inf.{j;ji|=i}  5(2)  >  -00  it  now  follows  that  liminfj,— 0  9iy)  >  0,  implying  that 
(0,-1)  ^  £g.  The  set  Sg  is  closed,  convex  and  non-empty  (for  example  (0,1)  €  Sg).  Thus,  there 
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exists  a  hyperpiane  in  x  ffi,  which  strictly  separates  the  point  (0,  —1)  and  the  set  £g  (this  is  a 
particular  instance  of  the  Hahn-Banach  theorem  quoted  in  Appendix  ??).  Specifically,  there  exist 
A  £  IR'^  and  p  6  IR  such  that 

<  A,  0  > +p  =  p  ><  A,  y  > ■  (2.3.96) 

Considering  y  =  0  it  is  clear  that  p  >  0  and  then  by  specializing  (2.3.96)  to  ^  =  g{y)  one  obtains 
that  g*{j)  <  1  where  g’  is  the  Fenchel- Legendre  transform  of  g(y).  Observe  now  that  g*  assumes 
only  the  values  0  or  oo  as 

A  ,  f  0  A  6  C 

p’(A)  =  sup(<  A,  y  > -p(t/))  =  sup  q[  sup  (<  A,  2  >  -y(z))  ]  =  <  (2.3.97) 

y  a>0  {.:k|=l}  1^  oo  X^C 

where  C  =  {A  :  |z|  =  1  =>  y(5)  ><  A,  r  >}.  Thus,  the  set  C  must  be  non-empty  or  equivalently 
there  exists  Aq  6  IR*^  such  that  y(y)  ><  Aq,  y  >  for  all  y  €  IR”^. 

Considering  now  (2.3.93)  one  obtains 

A*(z)  —  .'\.*(.'e)  ><  Ao,z  —  X  > 

for  aU  2  6  IR*^.  Therefore,  also 

<  Ao,  r  >  -A*(z)  =  sup  (<  Ao,  2  >  -iV{z))  >  A(Ao)  ,  (2.3.98) 

where  the  above  inequality  is  a  property  of  the  Legendre  transform  of  any  convex,  lower  semicontin- 
uous  function  /  such  that  /(•)  >  —00  everywhere  and  /(■)  ^  00  (consider  Section  ??  for  a  proof). 
Indeed,  it  is  assumed  in  the  statement  of  this  lemma  that  A(-)  is  lower  semicontinuous  and  the 
other  conditions  mentioned  above  are  satisfied  in  view  of  part  (a)  of  Lemma  2.3.1.  The  inequality 
(2.3.98)  amounts  to  x  €  5A(Ao)  for  some  Aq  6 

(b).  Suppose  there  e.xists  a  point  p  6  \  such  that  dA{T])  is  non-empty.  Then,  dA(T))  =  {x  : 

A*(x)  =  0}  is  a  non-empty  closed  convex  set  (since  A”  is  a  convex  rate  function).  Suppose  that  an 
infinite  line,  say  {20  4-  t-2}teR  is  contained  in  dA{ri).  Then,  for  all  0  €  IR"^  and  any  t  €  IR, 

A(^)  >  A(77)-I-  <  6  -  ri,zo  >  +t  <  6  -  tj,z  >  (2.3.99) 

which  is  possible  only  if  T>\  C  {ff  :  <  9,z  >=  0}  contradicting  Assumption  2.3.1.  From  the  above, 
it  follows  that  dA{r])  is  a  conve.x  closed  set  which  does  not  contain  infinite  lines  and  as  such  by  [24], 
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18.5.3.  it  contains  an  e.xposed  point,  i.e.  there  e.xists  x  G  0\(rj)  and  a  vector  v  6  IR‘^  such  that 

<  V,  a;  >  >  <  V.  r  >  Vz  6  z  ^  x  .  (2.3.100) 

Let  the  normal  cone  to  Pa  a-t  77  be  defined  as 

aV*  =  {n  :  <  A  -  77,  n  >  <  0  for  all  A  G  Pa}  (2.3.101) 

and  note  that  j\f  is  non-empty  since  Pa  is  a  convex  set  of  non-empty  interior  and  77  G  Pa  \  ^a* 
Then,  for  any  n  €  .V  and  any  A  G 

A(A)  >  A(77)-f-  <  A  -  77,  X  >  >  .\(77)-|-  <A-77,  x-|-n>, 

so  that  x-f  n  G  dA{ri)  where  x  G  d.\{T])  is  the  exposed  point  defined  above.  Therefore,  in  particular, 
by  (2.3.100) 

<  V  ,n  >  <  0  Vn  G  A''.  (2.3.102) 

We  next  claim  that  rj  +  6v  £  P°  for  ail  (5  >  0  small  enough.  Indeed,  assume  otherwise,  then  by 
[24],  23.7.1  there  exists  n  G  aV  such  that 

sup  <A.n><<77,  n><<77-f  ^v,  n  >  (2.3.103) 

contradicting  the  inequality  (2.3.102)  above. 

Choose  now  a  sequence  A„  =  77  -f-  G  P^  such  that  6n  —*  0.  Since  VA(A„)  G  5A(A„)  for  any 
n  it  follows  that 

A(0)  >  A(A„)+  <d-\n,  VA(A„)  >  ,  G  IR'^  (2.3.104) 

Because  of  the  assumption  that  A  is  a  steep  function.  £„  =  1/|VA(A„)|  —  0  as  ti  — »•  00  and 
e„VA(A„)  has  a  limit  point  y  G  IR'^  with  jyl  =  1.  Passing  to  the  convergent  subsequence  {A„},  for 
any  n  large  enough  and  any  9  G  IR"^, 

A(0)  >(!-€.)  A(77)  -f  (T  -  €„)  <9-tj,x> 

+  Cn  A(A„)-|-  <  9  —  A„,  e„VA(An)  >  (2.3.105) 

where  x  G  dA{ri)  is  as  specified  in  (2.3.100).  In  the  limit  ti  — ♦  00  by  the  upper  semicontinuity  of 
the  convex  function  A(-) 

A{9)  >  A{  Tj)+<  9 -7],  x  +  y  >  (2.3.106) 
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(as  iimsup„_^  A(A„)  <  A(ri)  <  oo).  Therefore,  in  particular 

<v,  y>  <0  (2.3.107) 

By  comparing  (2.3.104)  for  ^  =  A„  =  77  +  ^„v  with 

A(A„)  >  A(r7)+  <  A„  -  77, z  >=  A{ri)  +  6^  <  v,x  > 

one  obtains 

<  V,  z  >  <  <  V,  VA(A„)  >  (2.3.108) 

for  the  same  subsequence  {A„}  as  used  above.  Multiplying  by  >  0  and  taking  the  limit  n  — *•  00 
yields  <  v,  y  >  >  0  in  contradiction  with  (2.3.107).  In  conclusion,  dA(Tf)  =  0  for  any  77  €  'D\\T>‘^. 

□ 


Exercises: 


2.3.1  Prove  that  for  any  1.1  €  A/i(S),  the  relative  entropy  i7(-|/i)  is  the  Fenchei-Legendre  trans¬ 
form  of  the  function  A(-)  defined  in  (2.3.82). 

Hint:  Prove  first  that  Pa*  =  Then,  show  that  u(a{)  =  0  and  1/  G  iV/i(S^)  imply  that 

the  value  of  A''(:/)  is  obtained  by  taking  A, - 00.  Finally,  show  that  u  =  VA(t7)  when  is  a 

probability  vector  with  =  S  and 


A,  Ha,) 

Tji  =  log  — — - 

lM«.) 


1,..., 


(2.3.109) 


Conclude  that  then  A’(i/)  =<  >  -A(ri)  = 


2.3.2  (a).  Use  the  exponential  form  of  Markov’s  inequality  to  prove  that  for  any  C  C  IR'^  any  n 
and  any  A  €  IR'^ 

aniogp.n{C)  <  -  inf  <  A,7/>  -fa„A„(a“^A)  . 


(b).  Assume  that  A(A)  =  anAn(a^('A)  for  any  n  (examples  where  this  is  true  are  given  in  Theorem 

2.3.2  and  in  exercise  2.3.5).  Recall  the  following  version  of  the  min-max  theorem:  let  g(d,y)  be 
convex  and  lower  semicontinuous  in  y,  concave  and  upper  semicontinuous  in  d.  Let  C  C  IR*^  be 
convex  and  compact.  Then 

inf  sup  g{e,  y )  =  sup  inf  g{d,  y) 
y€C  0  0  y€C 
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(c.f.  [10],  pg.  174).  Apply  this  theorem  to  justify  the  upper  bound 

a„log/XT.(C’)  <  -  sup  mn<  X.y>  -A(A)]  =  -  inf  A*(2/) 

ASiR" 

for  any  n  and  any  convex,  compact  set  C. 

(c) .  Establish  (2.3.64)  for  all  compact  sets  by  applying  the  above  bound.  Note  that  this  approach 
yields  a  concrete  upper  bound  for  any  finite  n. 

(d) .  Find  a  compact  set  B  for  which 

sup  inf  [<  A,y  >  -A(A)]  <  inf  A*(y)  . 

y€B 

2.3.3  Prove  that  in  the  assumptions  of  Theorem  2.3.2,  the  assumption  of  lower  semicontinuity 
of  A  may  be  dropped  as  it  follows  from  the  steepness  and  i.i.d.  structure. 

Hint  you  need  to  extend  exercise  2.2.4  to  IR'^.  The  only  difficulty  is  with  sequences  A„  -*■  A  not 
along  a  line,  but  such  that  their  angle  from  a  line  converges  to  zero.  Use  the  last  observation 
to  bound  the  difference  between  these  two  situations  in  terms  of  the  function  G  introduced  in 
exercise  2.2.4. 


2.3.4  Let  -  ••  be  samples  of  a  Brownian  motion  at  i.e.  -  wtj)  is  a 

Normal  random  variable  independent  of  wt(,  I  <  j,  of  variance  (tj+i  —  tj)  and  zero  mean.  Find 
the  rate  function  for  the  empirical  mean  5„  of  A',  =  where  w\..  i  =  1, ...,ra  are 

samples  of  independent  Brownian  motions  at  instances  tj. 

Remark;  Note  that  the  law  of  Sn  is  the  same  as  that  of  and  compare  to 

Schilder's  Theorem  which  is  presented  in  Section  ??. 


2.3.5  Let  Xj  be  i.i.d.  random  variables  over  with  a  steep  logarithmic  moment  generating 
function  A  such  that  0  €  Let  N(t)  be  a  Poisson  process  of  unit  rate  which  is  independent  of 
the  Xj  variables  and  consider  the  random  variables 


/V(n) 

i',  4  E  X,-  . 

j=i 


Prove  that  the  family  of  laws  corresponding  to  5n  satisfies  a  large  deviations  principle  with 
the  rate  function  being  the  Fenchel-Legendre  transform  of  -  1. 


47 


Hint:  You  can  apply  Theorem  2.3.2  as  N{n)  =  where  Nj  are  i.i.d.  Poisson(l)  random 

variables. 


2.3.6  Let  N{n)  be  a  sequence  of  integer  valued  random  variables  whose  logarithmic  moment 
generating  functions  An  satisfy  the  Assumption  2.3.1.  Let  Xj  be  i.i.d.  random  variables  over  1R‘^ 
with  finite  everywhere  logarithmic  moment  generating  function  Ax  and  let  denotes  the  law  of 


/V(n) 


Zn  =  an  X,-  . 


j=l 

Prove  that  if  the  conditions  of  Lemma  2.3.2  hold  for  the  convex  function  A(A;f{A))  then  n„ 
satisfies  a  large  deviations  principle  governed  by  the  Fenchel-Legendre  transform  of  this  function. 


2.3.7  For  any  ^  >  0  let  Zn,s  =  Zn  +  where  V  is  a  standard  multivariate  Normal  random 

variable. 

(a) .  Prove  that  when  Assumption  2.3.1  holds  for  Zn  it  also  holds  for  Zn,s  with  the  limiting 
logarithmic  moment  generating  function  Ai(A)  =  A(A)  +  ||Ap. 

(b) .  Show  that  for  any  x  6  the  value  of  the  Fenchel-Legendre  transform  of  A^  does  not 
exceed  A*(a:). 

(c) .  Prove  that  if  \imn-,oo  E{Zn)  =  aT  6  exists  then  A(A)  >  <  A,x  >  for  any  A  G 
Conclude  that  if  in  addition  A  is  finite  and  differentiable  everywhere  then  Xs  =  (where 
JFs  =  {x  :  X  =  VA5(A)  for  some  A  G  IR'^})- 

(d) .  By  applying  Theorem  2.3.1  for  Zn,s  and  (a)-(c)  above  deduce  that  for  any  x  G  and  any 

e  >  0 


lii^inf  On  log  Prob(Z„,5  G  >  -A*(x) 


(e) .  Prove  that 

(f) .  Prove  that 


lim  sup  a„  log  Prob(  \/i5anl  Y|  >  e/2)  < 


86 


(2.3.110) 

(2.3.111) 


Prob(Z„  G  5i,£)  >  Prob(Zn,i  G  -  Prob(v'^^lY|  >  e/2)  (2.3.112) 

and  by  combining  (2.3.110),  (2.3.111)  and  (2.3.112)  (for  n  —<■  oo  and  then  — <■  0)  conclude  that 
the  large  deviations  lower  bound  holds  for  the  laws  corresponding  to  Zn. 
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(g).  Deduce  now  by  part  (a)  of  Theorem  2.3.1  that  when  Assumption  2.3.1  holds  with  A  which 
is  finite  and  differentiable  everywhere  and  when  moreover  limn-,00  ^(■^n)  exists  then  satisfy  a 
large  deviations  principle  with  rate  function  A’. 

Remark:  This  may  serve  for  example  as  an  alternative  derivation  of  Cramer’s  theorem  which 
avoids  the  convex  analysis  Lemma  2.3.2. 


2.3.8  Let  Xi,...,X„,...  be  a  real-valued,  zero  mean,  stationary  Gaussian  process  with  co- 
variance  sequence  Ri  =  E{XnXn+i).  Suppose  the  process  has  a  finite  power  P  defined  via 
P  =  lim„_oo  Z!r=_n  -^*(1  “  ^)-  Let  be  the  law  of  the  empirical  mean  Sn  of  the  first  n  samples 

of  this  process.  Prove  that  /j.n  satisfy  a  large  deviations  principle  controlled  by  the  good  rate 

2 

function  A"(x)  =  fp. 


2.3.9  Again,  let  Xi, . . . ,  .Y„, . . .  be  a  real-valued,  zero  mean,  stationary  Gaussian  process  with 
covariance  sequence  Ri  =  £(X„A'„+i).  Assume  that  this  covariance  sequence  is  absolutely 
summable  and  let  5(w)  >  0  denote  its  Fourier  transform.  Consider  the  empirical  covariances 


n-j 


+j 


i=l 


for  j  =  0,...,d  —  1.  Let  6  be  the  empirical  covariance  vector  composed  of  {Zl}  and  A„ 
the  corresponding  logarithmic  moment  generating  functions. 

(a).  Verify  that 

Anind)  = -  XiiOK)]  . 


1=1 


Here  Ai(©R)  is  the  i-th  eigenvalue  of  the  product  of  the  covariance  matrix  R  and  the  matrix  © 
where  Q{j,  k)  =  Oj-k  for  ail  j  A:  +  d  -  1}  and  is  zero  otherwise. 

(b).  Assume  that 


Um  -if^log[l  -  Ai(©R)]  =  riog[l  -  S{u)''£eke-^'^’‘]duj  ^  A{9) 

n-*oo  In  r-f  JQ 

1=1  *=0 

(this  identity  indeed  holds  by  the  limiting  distribution  of  near  Toeplitz  matrices,  see  [16]).  Prove 
that  the  empirical  covariance  vectors  Z-n  satisfy  a  large  deviations  principle  controlled  by  the 
Fenchel-Legendre  transform  of  the  function  A  defined  above. 
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2.4  Large  deviations  of  Markov  chains  over  finite  alphabets 


The  results  of  Section  2.1  are  extended  in  this  section  to  random  variables  Xi, . ,  .,X„  which  take 
values  in  the  finite  alphabet  E  =  {ui, . . .  air;|},  with  a  Markov  structure  instead  of  an  i.i.d.  structure. 
Although  most  of  the  results  may  be  derived  by  the  method  of  types  presented  in  Section  2.1, 
the  combinatorics  involved  are  quite  cumbersome  (for  more  details  about  this  approach  consider 
exercise  2.4.7).  Thus,  an  alternative  derivation  of  these  results  via  an  application  of  Theorem  2.3.1 
is  adopted  here.  Without  loss  of  generality,  identify  S  with  the  set  {1,...1E|}  so  that  a,-  =  i. 

Let  n  =  {7r(i,i)},  j=i...|£|  be  a  stochastic  matrix,  i.e.  a  matrix  whose  elements  are  non-negative 
and  such  that  each  row-sum  is  one.  P"  denotes  the  Markov  probability  measure  associated  with 
the  transition  probability  11  and  initial  state  x  6  E.  Specifically, 

n— 1 

P"(A^  =  =  :cn)  = -(a;,xi)  7r(x,-,  x,+i) .  (2.4.113) 

:  =  1 

A  matrix  B  with  nonnegative  entries  is  called  irreducible,  if  for  any  pair  of  indices  i,j  there 
exists  an  m(i,j)  such  that  j)  >  0.  where  B”*  denotes  the  usual  product  of  matrices. 

This  property  is  equivalent  to  the  condition  that  one  may  find  for  each  i,j  a  sequence  of  indices 
such  that  ii  =  i,  im  =  j  a^nd  P(ii..,ifc+i)  >  0  for  all  li:  =  l,...,m  —  1.  The  following 
theorem  describes  basic  properties  of  irreducible  matrices. 

I V! 

Theorem  2.4.1  (Perron- Frobenius)  Let  B  =  {P(i,  j)};“_i  be  an  irreducible  matrix.  Then 
there  exists  an  eigenvalue  p  (called  the  Perron- Frobenius  eigenvalue)  such  that 

(a) ,  p  >  Q  is  real. 

(b) .  There  exist  right  and  left  eigenvectors  corresponding  to  the  eigenvalue  p  which  are  strictly 

positive,  i.e.  there  exist  vectors  p,  d  loith  /i,,  d,-  >  0  for  all  i  (this  is  denoted  in  the  sequel  by 
p  >>  0,d  >>  0)  such  that 

|S| 

B{i,j)dj  =  pdi  (2.4.114) 

j=i 

|S| 

YY  piB{i,j)  =  ppj  (2.4.115) 

i=l 
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(c) .  For  any  eigenvalue  A  o/B,  |A|  <  p. 

(d) .  The  right  and  left  eigenvectors  p,d  corresponding  to  the  eigenvalue  p  are  unique  up  to  a 

constant  multiple. 


(e).  Let  4>  be  any  strictly  positive  vector,  then  for  any  i  G  S 

1 


lim  —  log 

n—»<x>  n 


|S| 

HB-ii,  ])<!>, 


lim  —  log 

n^oo  n 


.i=i 


=  log  p  (2.4.116) 


Proof:  Parts  (a)-r-(d)  are  stated  for  example  in  [27],  Theorem  1.5.  To  prove  part  (e),  let  a  = 
sup,-  •Oi,  0  =  inf,-  1?,-  >  0  and  7  =  inf ^  <f>j  >  0,5  =  sup^-  4>j  (where  t?  is  the  right  eigenvector 
corresponding  to  p  above).  Then,  for  all  i,j  6  S, 


p  a 


(2.4.117) 


Therefore, 


lim  log 

n— 00  n 


=  lim  —  log 

n-*co  ji 


.J=l 


=  lim  —  logfp’^  t?,-)  =  logp  . 

n— *00  n 


(2.4.118) 


A  similar  argument  leads  to 


lim  —  log 

n— CO  n 


■|S| 

j=i 


=  logp. 


(2.4.119) 

□ 


2.4.1  Cramer’s  theorem  for  Markov  additive  processes 


Let  p  :  S  be  a  given  deterministic  function.  The  large  deviations  of  the  empirical  g-mean 


=  -  E  OiX,) 


(2.4.120) 


fc=i 


are  the  subject  of  this  section  (for  the  e.xtension  to  random  functions  see  exercise  2.4.1).  If  Xt  were 
independent,  Cramer’s  Theorem  2.1.1  applies  and  then  the  large  deviations  principle  is  governed 
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by  the  Legendre  transform  of  the  logarithmic  moment  generating  function.  Theorem  2.3.1  hints 
that  the  rate  function  may  still  be  expressed  in  terms  of  a  Legendre  transform  even  in  the  current 
dependent  case  where  the  Xk  possess  a  Markov  structure. 

In  order  to  find  the  proper  function  which  replaces  the  logarithmic  moment  generating  function 
A(A)  associate  with  any  A  6  a  non-negative  matri.x  n,\  via 

€  S  .  (2.4.121) 

When  n  is  irreducible  it  follows  that  EIa  are  also  irreducible  matrices  since  is  positive. 

Let  p(nA)  denotes  the  Perron- Frobenius  eigenvalue  of  IIa  then  log  pCIIa)  plays  the  role  of  the 
logarithmic  moment  generating  function  A{  A).  Specifically,  the  following  analog  of  Theorem  2.3.2 
holds. 


Theorem  2.4.2  Assume  n  is  irreducible  and  define 

I{z)  =  sup  {<  \,z>  -  log  pCIIa)}  (2.4.122) 

Then,  /(•)  is  a  good,  convex,  rate  function  which  controls  the  large  deviations  of  the  empirical 
g-means  {Zn},  Le.  for  any  measurable  set  F  C  H'^.  and  any  initial  state  x  G  S, 


Proof:  Define 


A„(A)  =  log  El 
In  view  of  Theorem  2.3.1,  it  is  enough  to  check  that  the  limit 

A(A)=  lim  -.■U(nA)=  Hm  -logi:; 


n— oo  n  n—*OQ  u 

e.xists  for  all  A  6  IR'^,  is  differentiable,  and  that  .'\.(A)  =  logp(nA).  Note  that 
A„(nA)  =  logjEJ 


=  log  XI  P'(-Yi  =  xa,...,X„  =  x„)]][e<^’^^"*)> 

xi  j!:=l 

=  log  XZ  7r(x,Xi)e<’^'®^^’^>  ■  •  •7r(x„_i,x„)e<^’®('''*)> 

Xi 

|S| 

=  iogX:(nA)"(x,i) 

J=1 


(2.4.124) 


(2.4.125) 


(2.4.126) 
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Since  IIa  is  an  irreducible  matrix,  part  (e)  of  the  Perron- Frobenius  theorem  yields  (for  =  1) 

A(A)  =  lim  -  An(nA)  =  log  p(nA)  (2.4.127) 

Moreover,  since  1S|  is  finite  it  is  clear  that  p(n>),  being  an  isolated  root  of  the  characteristic 
equation  for  the  matrix  IIx  is  differentiable  with  respect  to  A  (see  [14]  for  details).  Therefore, 
Theorem  2.3.1  may  be  applied  to  complete  the  proof.  □ 

Remark:  The  above  proof  relies  on  two  properties  of  the  Markov  chain  -  namely,  part  (e )  of  the 
Perron-Frobenius  theorem  and  the  differentiability  of  p(nA)  with  respect  to  A.  Thus,  Theorem 
2.4.2  holds  as  long  as  the  Markov  chain  has  these  two  properties.  In  particular,  the  finiteness  of  S 
is  not  crucial  and  indeed  a  large  deviations  principle  for  a  general  Markov  chain  set-up  is  presented 
in  Section  ??. 

Exercises: 


2.4.1  Assume  that  have  the  joint  law  where  IT  is  an  irreducible  stochastic  matrix. 

Consider  the  empirical  means 


1  " 


where  the  conditional  law  of  Yk  when  Xk  =  j  is  ^ij  €  Mi(]R‘^)  and  for  any  given  realization 
{Xk}k=i  of  the  Markov  chain  states  the  variables  Yk  are  conditionally  independent.  Suppose  that 
the  logarithmic  moment  generating  functions  Aj  associated  with  fij  are  finite  everywhere  (for  all 
j  e  S).  Prove  that  Theorem  2.4.2  holds  in  this  case  where  now 


^xihj)  =  i.j  e  S  . 


2.4.2  Sanov’s  theorem  for  the  empirical  measure  of  Markov  chains 

A  particularly  important  application  of  Theorem  2.4.2  above  yields  the  large  deviations  principle 
satisfied  by  the  empirical  measure  of  Markov  chains.  Namely,  define  L^{i)  =  ^  5t(.iTj),  where 

1  f  =  X 
0  otherwise 

For  i  =  1, . . .,  [Sj,  let  an(i)  =  Ex[L^{i)]  =  ^  Z!j=i  Then,  -*  ij.  as  n  oo  where  /x  is  a 

unique  properly  normalized  left  eigenvector  of  IT  (since  |a„(n— I)(j)|  =  j)— 7r(x,  j)]  <  ^). 


5.(x)  i 
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Further,  by  Cliebychev’s  inequality,  as  n  —  co  and  therefore,  is  a  good  candidate  for 

a  large  deviations  statement  on  A/i(S). 

It  is  cleax  that  fits  into  the  framework  of  Section  2.4.1  with  g{x)  =  (i5i(ar), . , . ,  <5|j,j(x)). 

Therefore,  by  Theorem  2.4.2,  a  large  deviations  principle  follows  with  the  rate  function 

I[q)=  sup  (<  A,q  >  -log  p(nA))  ,  (2.4.128) 

where  here  T^\{iij)  =  and  q  €  Mi(S).  The  following  alternative  characterization  of /(g) 

is  sometimes  more  useful. 


Theorem  2.4.3 


/(g)  =  J(g)  ^  \ 


q  ^  A/i(S) 
q  e  A/i(E) 


(2.4.129) 


Remarks:  This  identity  actually  holds  also  for  non-stochastic  matrices  (see  exercise  2.4.3).  In  the 
i.i.d.  set  up  the  rows  of  11  are  identical  and  then  J(g)  is  merely  the  relative  entropy  /f(g|7r(l,  •)) 
(see  exercise  2.4.2). 

Proof:  Since  A/i(S)  is  a  closed  subset  of  its  complement  A/i(E)‘=  is  an  open  set.  Further, 
6  Mi(S),  for  any  n  and  any  realization  X.  Therefore,  by  the  large  deviations  lower  bound 
(2.4.123) 

-  oo  =  i  log  P;({/^  G  A/i(S)=})  >  -  inf  /(g)  ,  (2.4.130) 

i.e.,  /(g)  =  cc  for  any  g  ^  A/i(E). 

Fix  g  6  A/i(E),  u  >>  0  and  set  =  log  (since  u  >>  0  and  11  is  irreducible,  it  follows 

that  ull  >>  0).  Observe  that  uIIa  =  u  and  thus  p(IIa)  =  1  by  part  (e)  of  the  Perron- Frobenius 
theorem  (with  <l>i  =  m  >  0).  Therefore, 


Hq)  > 

j=i 


-  log  1 


Since  u  >>  0  is  arbitrary  this  inequality  implies  that  /(g)  >  J(g). 


j' 
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To  establish  the  reverse  inequality,  fix  an  arbitrary  vector  A  €  and  let  a  >  ^(11^)  be  any 
upper  bound  on  p(n,\).  Define 


u-  =  J^a--'£{Exni,j)<oo,  i  =  l,...,lS!  , 

n=0  1=1 


(2.4.131) 


where  the  finiteness  of  uj  is  a  direct  consequence  of  part  (e)  of  the  Perron- Frobenius  theorem. 
Moreover,  u'IIa  =  q:(u*  —  1)  (where  1  denotes  the  all  ones  vector)  and  u*  >>  0.  Thus,  by  the 
definition  of  IIa, 

,  X  .  ,  i?'  .  (u’n),-  ,  (u*nA)i 

<  >  -f  qj  log  — qj  log  - 


:=l 

|S| 


i=i 


o(u-  -  l)j 


=  log  a  ^  gi  log  I  1 - ;•  I  <  log  a  (2.4.132) 


j=i 


i=i 


Thus,  (<  A,g  >  —  logo)  <  Qj  ^°§  Take  a  i  p(nA)  to  deduce  that 


|S| 


<  A,g  >  -  log  p(n,\)  <  sup  Qj  7“?^  =  “^(9) 

u>>o^  (ull)j 

and  since  A  is  arbitrary,  also  /(g)  <  J(g)  and  the  proof  is  complete. 
Exercises: 


(2.4.133) 

□ 


2.4.2  Suppose  T{i,j)  =  p(j),  i,j  €  S  where  ^  €  Mi(S).  Prove  that  then  J(-)  =  if(-|/i)  (the 
relative  entropy  with  respect  to  p)  while  /(■)  is  the  Fenchel-Legendre  transform  of  the  moment 
generating  function  A  of  (2.3.82).  Thus,  Theorem  2.4.3  is  the  natural  extension  of  exercise  2.3.1 
to  the  Markov  set-up. 

2.4.3  (a).  Show  that  (2.4.129)  holds  for  any  non-negative  irreducible  matrix  H  (not  necessarily 
stochastic). 

Hint:  Let  4){i)  =  J2j  Clearly,  (t>  »  0,  and  thus  the  matrix  H*  given  by,7r*(f,j)  = 

x{i,j)l<t){i)  is  stochastic.  Prove  now  that  Jn*(g)  =  Jliio)  +  Ej  Qj^°z4>U)  for  any  g  G  and 
likewise  /n*(g)  =  Itl{q)  +  HiQj^°%<i>{j)  (where  Jn  ^nd  /jj  denote  the  rate  functions  J  and  I 
associated  with  the  matrix  11  via  (2.4.129)  and  (2.4.128)  respectively). 

(b).  Show  that  for  any  irreducible,  non-negative  matrix  n 

logp(n)=  sup  {-/(i/)}  (2.4.134) 
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This  characterization  of  the  spectral  radius  of  non-negative  matrices  is  useful  when  looking  for 
tight  bounds  (for  an  alternative  characterization  see  exercise  2.4.5). 


2.4.3  Sanov’s  theorem  for  the  pair  empirical  measure  of  Markov  chains 

The  large  deviations  principle  for  the  empirical  measure  of  a  Markov  chain  is  still  in  the  form  of 
an  optimization  problem.  Moreover  the  nice  interpretation  in  terms  of  entropy  (recall  Section  2.1.1 
where  the  i.i.d.  case  is  presented)  has  disappeared.  It  is  interesting  to  note  that  by  considering  a 
somewhat  different  random  variable,  from  which  the  large  deviations  for  may  be  recovered  (see 
exercise  2.4.4),  one  is  also  able  to  get  a  large  deviations  with  a  rate  function  which  is  an  appropriate 
relative  entropy. 

Consider  the  space  =  S  x  T,  which  corresponds  to  consecutive  pairs  of  elements  from 
the  sequence  X.  Note  that  by  considering  the  pairs  formed  by  A'l, . .  • , Xn,  i.e.  the  sequence 
XiA'2,  X2A'3,  . . .,  A',vY,+i, . . . , -Yn-i-Tn,  one  recovers  a  Markov  chain  with  state  space  and 
transition  matrix  specified  via 

X  /  X  j)  —  6i{i)  ~{i,j)  (2.4.135) 

For  simplicity  assume  throughout  this  section  that  EE  is  strictly  positive  (i.e.  7r(f,  j)  >  0  for  all  i,j). 
Then,  is  an  irreducible  transition  matri.x,  and  therefore  the  results  of  Section  2.4.2  may  be 
applied  to  find  the  large  deviations  rate  function  associated  with  the  pair  empirical  measures 

i^^^'^(2/)  =  -E<5„(A._i,A'.),  t/€S(2).  (2.4.136) 

^  i=i 

Note  that  6  A/i(E(^^)  and  therefore  is  a  good,  convex,  rate  function  over  this  space. 

The  next  theorem  characterizes  a,n  appropriate  relative  entropy.  The  following  definitions 

are  needed  for  that  purpose.  For  any  q  €  A/i(S(^^),  let  9(2)  =  qi^j)  be  its  marginal  and  when 
q{i)  >  0  let  q{j\i)  =  A  measure  q  €  A/i(S(^^)  is  called  shift  invariant  if  9(0  =  EE!i  9(*»») 

for  all  i  (i.e.,  both  marginals  of  q  are  identical). 


Theorem  2.4.4  Assume  IT  is  strictly  positive.  Then  for  any  q  €  M’i(S(^i), 


I^^\q)  = 


g(2)/f(5(-|2)  j  7r(2,  •)),  q  shift  invariant 
00  otherwise 


(2.4.137) 
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where  is  the  relative  entropy  function  defined  in  Section  2.1.1.  Specifically, 


//■((j(-|2)|7r(t,  •))  = 

j=i 


QU\i) 


(2.4.138) 


Remarks:  When  EL  is  not  strictly  positive  (but  is  irreducible)  the  theorem  still  applies  with 
replaced  by  Sn  =  {{i,j)  ■  >  0},  and  an  almost  identical  proof.  The  above  representation  of 

/(2)(g)  is  useful  for  example  in  characterizing  the  spectral  radius  of  non-negative  matrices  (see  ex¬ 
ercise  2.4.5)  and  in  establishing  the  analog  of  Sanov’s  theorem  for  time  weighted  empirical  measures 
(see  exercise  2.4.6).  It  is  also  useful  because  bounds  on  the  relative  entropy  are  readily  available 
and  may  be  used  to  obtain  bounds  on  the  rate  function. 


Proof:  By  Theorem  2.4.3 


I"'  uii 

sup  ^^,(.,;)log— 

(un(2))(i,;) 


u{i,j) 


|S|  iSl 

=  sup  ^^7(i,j)  log 

j  =  l  i=l  I  (I  ■\ 


L  fc 


(2.4.139) 


j) 


where  the  last  equality  follows  by  (2.4.135). 


Assume  first  that  q  is  not  shift  invariant.  Then,  q{jo)  <  J2k  9(^7  Jo)  for  some  jq.  For  u  such 
that  u{-,j)  =  1  when  j  ^  jo  and  u{-,jo)  =  e'^, 


|S|  |2| 

EE  qihj)  log 

j=l  :=1 


^(^j) 

.[J2k  u{k,i)]  ir{ij). 


J  =  1  Z  =  1 


ujij) 

.lS|u(l,i)ff(i,j). 


|S|  |E| 

=  -J2J2  {i-k(f,  j)}  +  a 

j=i  t=i 

which  implies,  by  considering  a  oo,  that  I^^\q)  =  oo. 


Y1  Jo)  -  q{jo) 


1=1 


(2.4.140) 


Finally,  when  q  is  shift  invariant  then  for  any  u  >>  0 


t=i  j=i 


'Efc  <k,i)q{j)- 
.Efc  «^(^%J)9(0. 


=  0. 


(2.4.141) 


Let  u(i|j)  =  w(i,j)/Ei^^(^>j)  and  q{i\j)  =  q{ij)/q{j),  i-e.,  qii\j)  =  qii,j)IYlk  q{^'>j)  (since  q  is 
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shift  invariant).  Now,  by  (2.4.139)  and  (2.4.141) 

|S|  |S| 

-  XI  <?(0-^(<?(-|0k(i,-))  =  sup  XX 


u{ij)q{i) 


i=l 


1S|  |E| 


j  =  l 


[Ek 


u(i\j) 


=  ,f);^P„XX  ^^Tgy  =  ,f"P„ )  -  X  90')-®’(9(-b')l^(-U)) 


(2.4.142) 


u»Oj_j^_i  <1VHJ )  u>>o 

Since  if(g(-|i)lti(-|j))  >  0  it  follows  that  <  Yli  9(0 •))  with  equality  whenever 

q  >>  0  (by  the  choice  u  =  q).  The  proof  is  complete  for  q  which  is  not  strictly  positive  by 
considering  a  sequence  Un  >>  0  such  that  u„  -^  <?  (  so  q{j)IIiqi-\j)\uni-\j))  -*  0  for  each  j).  □ 

Exercises: 


2.4.4  Prove  that  for  any  strictly  positive  stochastic  matrix  II 

J(i')=  „  inf  I^^Kq)  (2.4.143) 

ir-Y,,  9(‘.-)=‘'} 

where  J(-)  is  the  rate  function  defined  in  (2.4.129)  while  /(^'(•)  is  as  specified  in  (2.4.137). 
Hint:  There  is  no  need  to  prove  the  above  identity  directly.  Instead  observe  that  the  empirical 
measure  belongs  to  a  set  A  iff  G  {q  :  Yi  Qih  ■)  6  A}  (where  the  initial  condition  Xo  =  x 

is  equivalent  to  any  initial  condition  .Y_iA'o  =  (i,x)  for  the  chain).  As  the  projection  of  any 
measure  q  G  Mi(S<2))  onto  its  marginal  i/  G  3/i(S)  is  continuous  and  controls  the  large 

deviations  of  £„  deduce  that  the  right  side  of  (2.4.143)  is  a  rate  function  governing  the  large 
deviations  of  L^.  Conclude  by  proving  the  uniqueness  of  such  a  function  and  applying  Theorem 
2.4.3. 


2.4.5  (a).  Extend  the  validity  of  the  identity  (2.4.143)  to  any  irreducible  non-negative  matrix 

n. 


Hint:  First  extend  Theorem  2.4.4  to  any  irreducible  stochastic  matrix  H  by  replacing  with  Sn 
(see  the  remark  following  the  statement  of  this  theorem).  Then,  for  any  irreducible,  non-negative 
matrix  H  consider  H*  defined  in  exercise  2.4.3  and  verify  that  I^liq)  =  7n^(9)  +  Hi  g(01og<^(f)- 
(b).  Deduce  by  applying  the  identities  (2.4.134)  and  (2.4.143)  that  for  any  non-negative  irre¬ 
ducible  matrix  H 


-logp(n)=  inf  /(2)(p)  = 

geAf,(Sn)  qeMi(Zn) 


inf 

shift  invariant 


|S1  1E| 


XX  ^(*’7)  log 


.=1  j=i 


9(j|f) 

7r(i,y) 


(2.4.144) 
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This  is  Varadhan's  characterization  of  the  spectral  radius  of  non-negative  irreducible  matrices 
which  is  extremely  useful  for  many  applications, 

2.4.6  Assume  that  AT, ...  ,X„  have  the  joint  law  PJ  where  IT  is  an  irreducible  stochastic  matrix. 
Let  Ti,...,Tn  be  a  sequence  of  random  variables  over  T  =  {1,2, ...,£}  which  are  conditionally 
independent  given  any  realization  while  Prob(rfc  =  =  i)  =  p(i,t),  i  e  S,  t  €  T. 

Construct  the  partial  sums  Sm  =  and  let  /v„  be  the  stopping  time  where  first  hits  or 

exceeds  the  integer  value  n.  The  time  weighted  empirical  measures  L^’’^  are  defined  via 

■[  A'„-l 

=  -C  E  TkSjiXk)  +  (n  -  J]  . 

^  fc=i 


(a).  Suppose  that  p{i,C)  >  0  for  any  i  G  S.  Prove  that  satisfies  a  large  deviations  principle 
with  rate  function 


J(u)  =  inf 

Au 


iS| 


\^\  I 


i=i  i=i  t=i 


where 


|S|  C  |E|  ^ 

=  {g(-li)  €  Mi(SxT),a(i)  >  0  :  =  a(j),  E  E ^10  =  Ki).  Vj  €  S} 

1=1  1=1  t=i  1=1 


Hint:  Interpret  the  event  (Tk  =  t}  as  if  the  Markov  chain  freezes  in  its  current  state  Xk  for  t 
time  units  and  observe  that  is  merely  the  standard  empirical  measure  in  this  new  time  scale. 
Consider  the  Markov  chain  whose  state  space  S  x  T  consists  of  the  original  states  and  the  future 
time  spans  in  which  state  changes  are  still  forbidden  (starting  at  the  initial  state  (x,l)).  Show 
that  the  transition  from  (i,s)  to  (j,t)  in  this  chain  has  probability  ir(z,j)p(f,0  when  s  =  1  and 
Si(j)Ss-i(t)  otherwise.  Apply  Theorem  2.4.4  to  the  pair  empirical  measure  of  this  chain 
Finally,  observe  that  s),  {j,  t))  for  any  j  e  S,  and  apply  a  "contraction 

argument”  of  the  type  hinted  about  in  exercise  2.4.4. 

(b).  Suppose  that  p{i,t)  =  p{t)  and  7r(z,j)  =  p{j).  Prove  that  now  the  rate  function  for  L^’'^  is 


7(1/) 


inf 


H{q\p  X  p) 
E,{T,) 


where  £’,(Ti)  =  and  q  G  Mi(S  x  T). 
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2.4.7  (a).  Prove  that 


,  |2|  1S| 

-  =  Xi,...,Xn  =  Xn)  =  I^'^^^(l,j)log7r(l,  j) 

^  ^=l j=l 

for  any  sequence  x  =  6  S"  of  non-zero  PJ  probability. 

(b) .  Let 

Cn  =  {q--q  =  Pri^i  =  xi,...,Xn  =  x„)  >  0  for  some  X  6  E"  } 

be  the  set  of  possible  types  of  pairs  of  states  of  the  Markov  chain.  Prove  that  £„  c  Mi(En)  and 
\Cn\  < 

(c) .  Let  T{q)  be  the  type  class  o^  q  e  Cn,  namely  the  set  of  sequences  x  of  positive  probability 

for  which  =  q  and  let  H{q)  =  Y.i,j  7(^j)log(7(j|t).  Suppose  that  for  any  g  6  Xn 

(n  +  i)-(iS|'+|3:i)gn//(?)  <  <  ^nH(q)  (2.4.145) 


and  moreover  that 


lim  dv^(g, £„)  =  0  VgSiUifEn),  q  shift  invariant  .  (2.4.146) 

n— *00  ^  ^ 

X  (2\ 

Prove  by  adapting  the  method  of  types  of  Section  2.1.1  that  X„  '  satisfies  a  large  deviations 
principle  with  the  rate  function  specified  in  (2.4.137). 

Remark:  The  estimates  (2.4.145)  and  (2.4.146)  are  consequences  of  a  somewhat  involved 
combinatorial  estimate  of  |T(g)|  (see  for  example  [18],  eq.  (35)-(37)  and  references  therein) 


2.5  Long  rare  segments  in  random  walks 


Let  AT, . . . ,  ATn, ...  be  i.i.d.  random  vectors  in  with  A(-)  a  steep  function  such  that  A(A)  <  oo 
for  some  open  ball  around  the  origin.  Let  .4  be  any  rare  A*  continuity  subset  of  namely,  such 


that 

I  A  =  inf,  A'(x)  =  inf  A’(x)  >  0  , 

xSA 

where  A“(i)  is  the  Legendre  transform  of  A(A)  defined  in  (2.3.60). 

Consider  the  random  walk  Sk  =  A',,  A:  =  1, 2,  •  •  • ,  5o  =  0  and  let 


max 


m  —  k  :  0  <  k  <  m  <  n  , 


Sm  -  Sk 

m  —  k 


(2.5.147) 


(2.5.148) 
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Thus,  is  the  maximal  length  among  all  segments  of  the  random  walk  up  to  n  in  which  the 
empirical  mean  is  within  the  set  A. 

Associated  with  is  the  dual  variable 

=  inf  |m  :  ^  €  A  for  some  0  <  /:  <  m  —  r|  ,  (2.5.149) 

so  that  >  r}  if  and  only  if  {Tr'^^  <  n}. 

The  analysis  of  the  random  variables  and  has  applications  in  problems  of  the  sta¬ 
tistical  analysis  of  DNA  sequence  matching  and  in  the  analysis  of  search  algorithms  in  computer 
science.  The  following  Theorem  yields  estimates  on  rare  events  which  are  usually  associated  with 
the  probability  of  errors  for  matching  algorithms.  For  some  applications  and  refinements  of  these 
estimates,  c.f.  [23]  and  the  exercises  at  the  end  of  this  section. 

Theorem  2.5.1  lim„_oo(-Rn'^Vlog^)  =  limr—coC?’/ logri"^^)  =  almost  surely. 


Proof:  By  the  Borel-Cantelli  lemma  and  utilizing  the  duality  of  events 

the  theorem  follows  from  the  estimates 

>  r}  =  <  n} 

OO 

Prob  <  e’‘(^-^-‘))  <  OO, 

r=l 

Ve  >  0 

(2.5.150) 

OO 

Y,  Prob(rJ^)  >  <  OO  , 

r=l 

Ve  >  0 

(2.5.151) 

when  Ia  <  OO  and  from 

CO 

X^Prob(ri^)  <  e"/')  <  OO, 

Ve>  0 

(2.5.152) 

r=l 


when  /a  =  oo. 

The  desired  estimates  (2.5.150),  (2.5.151),  and  (2.5.152),  are  immediate  consequences  of  the 
bounds 

Prob(T,['^^  >  n)<  e-LrJ  (2.5.153) 

and 

CO 

Prob(Ti^>  <  n)  <  ;i<(A),  (2.5.154) 

t=T 
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coupled  with  Cramer's  theorem  (Theorem  2.3.2) 


r-™  1  ’  (2.5.155) 

where  yn  denotes  the  law  of  5;  =  4  S(  for  6  €  Z'^  and  denotes  the  largest  integer  which  is 
not  larger  than  j.  Indeed,  assuming  (2.5.154),  one  has  by  substituting  n,=  that  when 

Ia  <  oo 


00  00  CO  GO 

X;Prob(r(^)  <  <  c' e— <  oo 

r=l  r=l  (zzr  r=l 

for  some  positive  constants  c,c'.  When  /^  =  oo  one  obtains  (2.5.152)  by  choosing  n  =  Le’'/*J 
in  (2.5.154)  and  following  the  same  line  of  proof.  Similarly,  starting  with  (2.5.153)  and  choosing 
n  =  one  obtains 


oo  CO  ,  .00 

X]Prob(r,(^>  >  e’-(^-^+‘))  <  £exp  ( — exp(-cV'’‘)  <  cxd  , 

r=l  r=l  ^  ^  '  T=1 

for  some  positive  constants  c,c',c".  We  turn  therefore  to  the  proof  of  the  bounds  (2.5.153)  and 
(2.5.154).  The  bound  (2.5.153)  follows  by  the  inclusion 

Lr  J 

<n}  D  (j  B,  =  \  -(5,,  -  5(^_i),)  €  a]  ,  (2.5.156) 

fci 


as  it  impUes 


Prob(r,("‘^  >  n)  <  1  -  Prob 


/LrJ 

U 


(1  -  Prob(5i))Lr  J  <  e-L^J  Prob(BO 


(2.5.157) 


due  to  the  fact  that  {B(}gl■^  are  independent  events  related  to  disjoint  segments  of  the  random 
walk,  of  equal  probabilities  Prob(5;)  =  /tr(A),f  =  1,2,---. 


The  bound  (2.5.154)  follows  by  the  inclusion 

n— r  n  n— I  oo  r'  C  C  ^ 

<  n}  C  U  U  Ck,m  c  U  Q  Ck,m  ,  Ck,m  =  {  '  ~  ~  €  a}  ,  ^  (2.5.158) 

^=0  m^k+r  k—Q  m=/c+r  4  ^  '  1 

and  the  union  of  events  bound.  Note  that  Prob  (Cfc.m)  =  fXm-k{A),  and  m  —  k  >  r  while  in  (2.5.158), 
there  are  at  most  n  possible  choices  of  k.  □ 

Remark:  Note  that  Theorem  2.5.1  holds  as  long  as  (2.5.155)  holds.  For  example,  consider  exercises 
2.5.3  and  2.5.4. 


Exercises: 
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2.5.1  Suppose  that  Ia  <.oo  and  (2.5.155)  may  be  refined  to 

iim  =  a  , 

r— *oo  * 

for  some  a  €  (0,oo)  (such  an  example  is  presented  in  Section  2.10  for  d  =  l).  Let 

^  IaR^u'^  -  logn  d 
"  log  log  n  2 

(a) .  Repeat  the  above  calculations  and  prove  that  limsup„_oo  |.Rn^^|  <  1  almost  surely. 

(b) .  Also  deduce  that  lim„_oo  Prob(iil'^^  <  — e)  =  0  for  all  c  >  0. 

2.5.2  Let  A  =  {1}  and  Xi  be  i.i.d.  Bernoulli(p)  random  variables.  Then,  is  the  longest 

consecutive  run  of  1-s  in  the  binary  sequence  AT,...,Xn.  Let  the  renewal  times  Zi,Z2,--  -  be 
the  locations  of  zeros  in  this  sequence  (with  Zq  =  O).  Then,  Qk  =  Zk  -  Z^-x  —  1,  A:  =  1,2,- •• 
are  i.i.d.  Geometric(l  -  p)  random  variables  and  •  •  • » Qfc}-  By  standard  renewal 

theory  ^  as  k  oo,  almost  surely.  Verify  that  here  I  a  =  -logp  and  deduce  that 

lim  -  logn)  =  0 

almost  surely,  whenever  limn— oo  fn  =  0. 

2.5.3  (a).  Consider  a  sequence  Xi, ...  ,X„  of  i.i.d.  random  variables  over  a  finite  alphabet  S  having 
a  marginal  law  p  such  that  =  S.  Let  be  the  longest  among  all  segments  of  this  sequence 
with  segmental  empirical  measures  in  the  o/jen  set  F  C  Mi(S).  Assume  that  p  ^  F  and  derive 
the  analog  of  Theorem  2.5.1  for  this  situation. 

(b).  Assume  further  that  F  is  convex.  Let  v'  be  the  unique  minimizer  of  .H'(-|p)  in  F.  Prove  that 
as  n  -»•  oo  the  empirical  measures  associated  with  the  segments  contributing  to  converge 
almost  surely  to  i/'. 

Hint:  Let  F^  =  F  n  5^.  ,5  where  Bu>,s  is  an  open  ball  of  radius  (5  >  0  around  i/*.  Prove  that 
lim  sup^_^^  j^)  <  1  almost  surely,  for  any  ^  >  0.  Deduce  that  for  any  ^  >  0  the  empirical 

measures  of  the  segments  contributing  to  are  eventually  within  distance  6  of  r/*  almost  surely. 

2.5.4  Assume  that  Xi,...,X„  and  Yi,...,F„  are  as  in  exercise  2.4.1.  Specifically,  Xk  are  the 
states  of  a  Markov  chain  over  the  finite  set  {1,2,...,  ISj}  with  an  irreducible  transition  matrix  11, 
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and  the  conditional  law  of  when  Xk  =  j  is  /ij  6  while  {Y*;}  are  independent  given  any 

realization  of  the  Markov  chain  states.  Further,  suppose  that  the  logarithmic  moment  generating 
functions  Aj  associated  with  /ij  are  finite  everywhere  and  define  the  matrices  IIa  via 

=  7r(i,  . 

Let  A*(i)  denote  the  Legendre  transform  of  logp(nA)  and  suppose  A  is  a  rare  A*  continuity 
set.  Let  be  the  longest  among  all  segments  whose  Y-segmental  empirical  mean  belongs  to 
A  C  Prove  that  Theorem  2.5.1  holds  with  A*  as  defined  here. 

2.6  The  Gibbs  conditioning  principle  in  finite  alphabet 

Let  Xi,X2,-  -  -Xn  be  a  sequence  of  i.i.d.  random  variables  with  law  /x  over  the  finite  alphabet 
S  C  and  assume  without  loss  of  generality  that  =  S.  The  following  question  is  of  funda¬ 
mental  importance  in  statistical  mechanics.  Given  a  set  A  €  IR  and  a  constraint  of  the  type  5„  €  A 
what  is  the  conditional  law  of  Xi  for  large  n  ?  In  other  words,  what  are  the  limit  points  of  the 
conditional  probability  vector 

/x*(a,)  =  Prob^(Ti  =  a,- 1  €  A)  z  =  1, . . |S|  ,  (2.6.159) 

as  n  — ►  oo  (recall  that  5„  =  ^  IIj=i  -Tj  =  <  a  >).  Note  that  for  any  function  /  :  S  — *■  IR^ 

=  £;[/(A'i)  1  6  .4]  =  £;[/(A'2)  I  5n  e  A] 

=  E[-  fix,)  1  5n  e  A]  =  £;[<  L^,f>  I  <  L^,b.  >€  A]  (2.6.160) 

where  we  have  used  the  fact  that  Xj  are  identically  distributed  (although  not  independent)  even 
under  the  conditioning  5„  €  A.  Therefore, 

=  E[L^  1  T?  €  r]  ,  (2.6.161) 

where  F  =  {z/  :<  i/, a  >€  A}  (compare  with  (2.1.19)  in  Section  2.1.2).  Therefore,  the  characteriza¬ 
tion  of  possible  limit  points  of  the  sequence  /x*  as  n  -*  oo  can  be  cast  in  terms  of  conditional  limits 
for  the  empirical  measures  L^. 
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Tlie  following  characterization  of  the  limits  of  /i";  is  a  consequence  of  Theorem  2.1.1  for  any 
non-empty  set  F  which  is  an  continuity  set,  namely, 


Jr  =  inf  =  inf  ir(i/|/t)  . 


(2.6.162) 


Theorem  2.6.1  (Gibb’s  principle) 

(a).  The  set  of  possible  limit  points  of  is  the  closure  of  the  convex  hull  of 


Mr  =  {i/  6  r  :  Hiu\p)  =  Jr} 


(2.6.163) 


(b).  For  any  convex  set  F  of  non-empty  interior  the  set  Mr  is  a  point  to  which  converges  as 
n  —*  CO. 

Remark:  For  conditions  on  F  (alternatively,  on  .4)  under  which  (2.6.162)  holds  see  e.xercises 
2. 1.1-2. 1.3  in  Section  2.1.1. 

Proof:  As  |S|  <  oo,  F  is  a  compact  set  and  thus  Mr  is  non-empty.  Moreover,  part  (b)  of  the 
theorem  follows  from  part  (a)  by  e.xercise  2.1.3  and  the  compactness  of  Afi(S)  (in  that  exercise  you 
showed  that  indeed  (2.6.162)  holds  when  F  is  a  convex  set  of  non-empty  interior  and  that  the  set 
yVlp  is  a  point). 

We  shall  prove  that  for  any  ^  >  0 

Jim^  ProblT^  e  Mf- 1 €  F)  =  1  ,  (2.6.164) 

with  an  exponential  (in  n)  rate  of  convergence,  where  Mf  =  •  dv{i>iMr)  < 

Since  iV/i(E)  is  a  bounded  set,  (2.6.161)  and  (2.6.164)  imply  that  for  any  6  >  Q,  p^  eventually 
belongs  to  the  convex  hull  of  M^^.  All  points  in  the  convex  hull  of  are  within  variational 
distance  6  of  some  point  in  the  convex  hull  of  Mr  (since  dy  is  a  convex  function  on  Mi(E)xMi(S)). 
Thus,  since  6  is  arbitrarily  small,  limit  points  of  p^  are  necessarily  in  the  closure  of  the  convex  hull 
of  as  claimed. 

The  limit  (2.6.164)  definitely  follows  from 

lim  sup  —  log  Prob^(X^  G  {MrY !  S  T)  = 
n->oo  n 

limsup  (ilog  Prob(j(i^  G  (yV(f)‘’ n  F)  -  -logProb;i(i^  G  F)1  <  0  .  (2.6.165) 
n—oo  1  n  n  J 
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However,  by  Theorem  2.1.1  and  (2.6.162) 

^  log  Prob;x(i^-  e  r)  ,  (2.6.166) 

whereas  by  Theorem  2.1.1  also 

limsupilogProb^(T^  e  (yVlf)‘'n  r)  <  -  inf  _  H{v\ti)  .  (2.6.167) 

n—oo  n  ue(Mf.)'=nr 

Observe  that  are  open  sets  and  therefore  (./Wf  )‘=  n  F  are  compact  sets.  Thus,  for  some 

i>  €  (Aff)'^  n  r 

inf  _  H’(i/|/i)  =  ^r(i/|^)  >  Jr  ,  (2.6.168) 

where  the  above  inequality  follows  from  the  definition  of  Alp  since  v  ^  Mv  while  P  €  T.  Finally, 
(2.6.164)  Mows  from  (2.6.165)-(2.6.168).  □ 

Remarks 

(a) .  Intuitively  one  e.xpects  Xi,...,J\ri:  to  be  asymptotically  independent  (as  n  — <•  cx5)  for  any 

fixed  when  the  conditioning  event  is  {L^  e  F}.  This  is  indeed  shown  in  exercise  2.6.3  by 
considering  “super-symbols”  from  the  enlarged  alphabet  S*'. 

(b) .  Theorem  2.6.1  holds  for  any  set  F  satisfying  (2.6.162).  However,  the  particular  conditioning 

set  :<  t', a  >G  H}  has  an  important  significance  in  statistical  mechanics  because  it 
represents  an  energy-like  constraint. 

(c) .  Recall  the  relationship  (2.1.23)  of  Section  2.1.2  which  implies  that  for  any  non-empty,  convex, 

open  set  A  C  K  the  unique  limit  of  is  of  the  form 

for  some  appropriately  chosen  A  6  IR^  which  is  called  the  Gibbs  parameter  associated  with 
A.  In  particular,  for  any  x  6  K°  the  Gibbs  parameter  A  associated  with  A  =  {x  -  S,x  +  S) 
converges  as  ^  0  to  the  unique  solution  of  the  equation  A'(A)  =  x  (for  details  see 

Section  2.1.2). 

(d) .  A  Gibbs  conditioning  principle  holds  beyond  the  i.i.d.  case.  Actually,  all  that  is  needed  is 

that  Xi  are  exchangeable  conditionally  upon  any  given  value  of  (so  that  (2.6.161)  holds). 
For  such  an  example,  consider  exercise  2,6.2. 
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Exercises: 


2.6.1  Prove  Theorem  2.6.1  by  the  method  of  types  (specifically,  use  Lemma  2.1.4  directly). 

2.6.2  Prove  the  Gibbs  conditioning  principle  for  sampling  without  replacement. 

(a) .  Observe  that  again  Xj  are  identically  distributed  even  when  is  given.  Conclude  that 
(2.6.161)  holds. 

(b) .  Assume  that  T  is  such  that 

i/er°  ygr 

and  define  Mr  =  {i^  6  F  :  =  /r}-  Prove  that  now  both  parts  of  Theorem  2.6.1  hold  (for 

part  (b)  you  may  relay  on  exercise  2.1.12). 


2.6.3  (a).  Suppose  that  S  =  (S')*  and  fi  =  are  a  A:-th  product  alphabet  and  a  k-th  product 
underlying  measure  on  it  and  assume  that  S'^,  =  S'  (as  usual).  For  any  law  v  €  Mi(S)  let 
i/(j)  g  Mi(S'),  j  =  l,...,k  denote  its  j-th  marginal  on  S'.  Prove  that 


I M)  >  i  E  I  >“')  a  ^(r  E  i  p) . 


Tt. 


with  equality  if  and  only  \i  u  —  (i/')*  for  some  u'  6  iV/i(S'). 
(b).  Assume  that 


,  k 

T  =  {u:-Yl  €  r'}  (2.6.169) 

^  i=i 

for  some  F'  C  iV/i(S')  which  satisfies  (2.6.162)  with  respect  to  fi'.  Prove  that  then  Mr  =  (Adp)*' 
and  conclude  that  any  limit  point  of  /i'  is  a  fc-th  product  of  some  appropriate  law  on  S'. 

(c).  Consider  now  the  k-th  joint  conditional  law 


Mn(®i 


)  =  Prob^K-Yi  =  , . . . .  Xfc  =  1  e  F')  «(• .  €  S'  ,  j  =  1,  •  •  • , 


>1  > ' 


where  X,  are  i.i.d.  with  marginal  law  6  Mi(S')  over  the  finite  alphabet  S'  and  F'  C  Mi(S') 
satisfies  (2.6.162).  Let  p,  =  (/i')*  be  the  law  of  ¥{  =  -  over  a  new  alphabet  S. 

Prove  that  for  any  m  € 


Amfc(ai)  =  Prob^(yi  =  fli  I  €  F) ,  V  a-,  6  S  , 
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where  F  is  defined  in  (2.6.169).  Deduce  that  as  n  oo  along  integer  multiples  of  k  the  random 
variables  X,,  i=  are  asymptotically  conditionally  i.i.d  (i.e.,  any  limit  point  of  is  a  /t-th 

product  of  an  element  of  AfifS')). 

(d).  Prove  that  the  above  conclusion  extends  to  n  which  need  not  be  an  integer  multiple  of  k 
whenever  Mr'  is  a  single  point. 

2,7  The  hypothesis  test  problem 

Consider  the  problem  of  hypothesis  testing  between  two  product  measures  for  the  i.i.d.  random 
variables  Yi,  •  •  • ,  Fn?  •  •  •■  Specifically,  Yj  are  either  distributed  according  to  the  law  fiQ  £  Mx(S) 
(hypothesis  ifo)  or  according  to  /ii  6  Mi(S)  (hypothesis  FTi).  The  alphabet  S  may  in  general  be 
quite  arbitrary  provided  that  the  probability  measures  no  and  (ii  are  well  defined  (Markov  chains 
over  finite  alphabet  are  considered  in  e.xercise  2.7.4). 

Definition  2.7.1  A  decision  test  S  is  a  sequence  of  maps  :  S"  ^  {0, 1},  for  n  =  1, 2,  •  •  •,  with 
the  interpretation  that  when  Fi  =  t/i,...,Fn  =  Vn  is  observed  then  Hq  is  accepted  (H\  rejected)  if 
<S"(yi, . . . ,  Vn)  =  0  while  Hi  is  accepted  (Hq  rejected)  i/«S”(yi, . . . ,  Pn)  =  1. 

The  performance  of  a  decision  test  S  is  determined  by  the  error  probabilities 

an  =  PTob^g(HQ  rejected  by  <S”)  ,  /?„  =  Prob^,(.H'i  rejected  by  S^)  ,  n  €  Z'*'  .  (2.7.170) 

One  wishes  to  minimize  /3„.  If  no  constraint  is  put  on  a^,  then  one  may  have  z=  0  with  the  test 
<5"(yi, . . . ,  j/n)  =  1  at  the  cost  of  On  =  1.  Thus,  a  sensible  criterion  for  optimality  is  to  seek  a  test 
which  minimizes  subject  to  a  constraint  on  Qn.  Suppose  now  that  the  probability  measures  PQ,p.i 
are  known  a-priori  and  that  they  are  equivalent  measures,  so  the  likelihood  ratios  Zo|li(y)  =  ^(y) 
and  Ti||o(2/)  =  3^(y)  exist  (some  extensions  for  po,p,i  which  are  not  equivalent  are  given  in  exercise 
2.7.3).  This  equivalence  assumption  is  valid  for  example  when  PQ,fii  are  discrete  measures  with 
or  when  S  =  IR'^  and  both  pq  and  pi  possess  strictly  positive  densities.  In  order  to 
avoid  trivialities  it  is  further  assumed  that  po  and  pi  are  distinguishable,  i.e.  that  they  differ  on  a 
set  whose  probability  under  po  (and  pi)  is  positive. 
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Let  Xj  -  log  ii||o(i^j)  =  -  log  Lo||i(^0)  tienote  the  observed  log-likelihood  ratios.  These  are 
bona-fide  i.i.d.  random  variables  over  IR^  which  are  non-zero  with  positive  probability.  Moreover, 

To  =  E,,[Xi]  =  E^,[Xie-^^]  , 

exists  (with  possibly  zq  =  -oo)  as  xe~^  <  1.  Similarly, 

Ti  =  E^,[X\]  =  >  E^,[Xr]  =  To  , 

exists  (with  possibly  Ti  =  co)  and  the  above  inequality  is  strict  since  Xi  is  non-zero  with  positive 
probability.  Note  that  To  and  Ti  may  be  both  characterized  in  terms  of  relative  entropy,  c.f.  exercise 
2.7.1. 

Definition  2.7.2  A  Neyman- Pearson  test  is  a  test  in  which  for  any  n  the  mean  observed  log- 
likelihood  ratio  Sn  =  compared  against  a  threshold  7^  and  Hi  is  accepted  (rejected) 

when  Sn  ^  7ti  (En  S.  fn)‘ 

It  is  well  known  that  Neyman- Pearson  tests  are  optimal  in  the  sense  that  there  are  neither  tests 
with  the  same  value  of  a„  and  a  smaller  value  of  nor  tests  with  .the  same  Vcdue  of  /?„  and  a 
smaller  value  of  q„  (see  for  example  [7]  for  a  simple  proof  of  this  claim). 

The  exponential  rates  of  ccn  and  dn  for  Neyman-Pearson  tests  with  constant  thresholds 
7  €  (To,Ti)  are  thus  of  particular  interest.  These  may  be  cast  in  terms  of  the  large  deviations 
of  Sn~  In  particular,  since  Xj  are  i.i.d.  real  valued  random  variables,  the  following  theorem  is  an 
application  of  Theorem  2.2.1. 

Theorem  2.7.1  For  any  Neyman-Pearson  test  with  constant  threshold  7  6  (To,Ti) 

Urn  -  log  Q„  =  -A5(7)  <  0  ,  (2.7.171) 

and 

Urn  -  log  /3„  =  7  -  A5(7)  <  0  ,  (2.7.172) 

where 

A2(z)  =  sup  {Az  -  Ao(A)}  (2.7.173) 

Ae[o,i] 

and  Ao(A)  =  log  ]  is  the  logarithmic  moment  generating  function  of  Xi  under  Hq. 
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Proof:  Both  (2.7.171)  and  (2.7.172)  follow  by  slight  modihcation  of  the  proof  of  Theorem  2.2.1. 
First  note  that 

Tq  =  lim  Ao(A)  <  7  <  lim  Ao(A)  =  . 

.\io  ,\ri 

Thus,  7  =  Ao(t?)  for  some  tj  e  (0, 1)  and  Aq(7)  indeed  equals  the  Legendre  transform  of  Aq  at  the 
point  7.  Now  by  (2.2.46) 

limsup  -  loga„  =  limsup  -  logProbj,o(5n  €  (7.00))  <  -  inf  Aq{x)  =  -A;(7)  (2.7.174) 

where  the  last  equality  follows  since  A5(-)  is  nondecreasing  on  [7,00)  as  7  >  xq. 

By  the  definition  of  Xj  the  logarithmic  moment  generating  function  associated  with  fii  is  merely 
Ao(A  +  1)  and  so  the  rate  function  Ai(x)  =  A5(x)  —  x  governs  the  large  deviations  bounds  for  the 
laws  of  Sn  under  Hi-  Apply  (2.2.46)  once  again  to  obtain 

limsup  -  log^n  =  limsup  -  logProb^,(5n  €  (-00,7])  <  -  inf  Ai(x)  =  -Ai(7)  (2.7.175) 

n— >00  n  n— «oo  n  r<7 

where  the  last  equality  follows  since  .Ai(-)  is  nonincreasing  on  (—00,7]  as  7  <  xi. 

Since  7  =  Aq(j7)  for  some  77  €  (0, 1)  where  Aq  is  a  strictly  convex,  C°°  function,  both  Ao(-)  and 
Aj(-)  are  continuous  at  the  point  7  (consider  further  exercise  2.2.5  of  Section  2.2).  Moreover,  for 
large  enough  r  the  lower  bound  (2.2.51)  applies  to  =  7  +  ^  and  Sr  =  j  implying  that 

^  ^  logProb^o(5„  e  By^^Sr)  >  -AS(yr)  (2.7.176) 

where  By^s  =  {y  —  S,y  +  S).  By  taking  the  limit  r  ^  00  and  combining  the  above  lower  bound  with 
(2.7.174)  one  deduces  (2.7.171).  Similarly,  one  has  for  Zr  =  7  —  7  (with  Sr  =  ^  and  r  large) 

liminf  -  log/3„  >  liminf  -  logProb^j(5„  €  B^^^Sr)  >  (2.7.177) 

n— >00  Tl  n— »00  n  \  / 

implying  (2.7.172)  in  the  limit  r  —  00.  □ 

Remarks:  (a).  Observe  that  Theorem  2.7.1  holds  even  when  xq  =  -00  or  Xi  =  co  or  both.  Its 
proof  is  actually  a  specialization  of  exercise  2.2.7  from  Section  2.2. 

(b).  A  refinement  of  Theorem  2.7.1  is  given  in  exercise  2.10.3  where  the  exact  limiting  behavior  of 
a„  (/3n)  is  derived. 

A  corollary  of  Theorem  2.7.1  is  ChernofF’s  asymptotic  bound  on  the  best  achievable  Bayesian 
probability  of  error 

=  Prob(  77o)a„  +  Prob(.ffi)/3„  (2.7.178) 


70 


(2.7.179) 


Corollary  2.7.1  (Chernoff’s  Bound)  If  Q  <  Prob(//o)  <  1  then 

— logP|®*}  = -AqCO)  =  -  inf  Ao(A)  , 
where  the  above  infimum  is  over  all  tests. 

Remarks: 

(a) .  In  particular,  Theorem  2.7.1  thus  implies  that  the  best  Bayesian  exponential  error  rate  is 
achieved  by  a  Neyman-Pearson  test  with  zero  threshold. 

(b) .  The  rate  Ao(0)  is  called  Chernoff’s  information  of  the  measures  a-nd 

Proof:  It  suffices  to  consider  only  Neyman-Pearson  tests.  Let  a’  and  be  the  error  probabilities 
for  the  zero  threshold  Neyman-Pearson  test.  Then  by  (2.7.171)  and  (2.7.172) 

T  T  =  -^^o(0)  (2.7.180) 

n— *co  Ti  n— *oo  ri 

For  any  test  either  >  a*  (when  7„  <  0),  or  f3n  >  (when  7„  >  0).  Thus,  for  any  test 

-logP,^*^  >  ilog[min{Prob(i7o) ,  Prob(/fi)}] -f- min{i logo;  ,  -log^*}  . 
n  n  n  n 

As  0  <  Prob(/fo)  <  1>  the  limit  n  —  co  yields  (2.7.179)  in  view  of  (2.7.180).  □ 

Another  corollary  of  Theorem  2.7.1  is  the  following  lemma  which  determines  the  best  exponen¬ 
tial  rate  for  /3n  when  a„  are  bounded  away  from  1. 

Lemma  2.7.1  (Stein’s  Lemma)  Let  (3^  be  the  minimum  of  /3„  among  all  tests  with  q„  <  e. 
Then,  for  any  €  <  1 

lim  -log/3'  =xo  •  (2.7.181) 

71— .oo  n 

Proof:  It  clearly  suffices  to  consider  only  Neyman-Pearson  tests.  Then, 

=  Prob^,(5„  >  7n)  >  (2.7.182) 

and 

=  Prob^j(5n  <  7n)  =  £'7ii[lsn<7„l  ~  >  (2.7.183) 

where  the  Izist  equality  follows  from  the  definition  of  A'j  (as  the  observed  log-likelihood  ratios).  The 
identity  (2.7.183)  yields 

i  log  i  log  <  7„  (2.7.184) 
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(a) .  Suppose  first  that  xq  =  -oo.  Then,  for  any  Neyman- Pearson  test  with  a  fixed  threshold  7, 
eventually  <  e  by  (2.7.171).  Thus,  ^  log  <  7  for  any  7  and  n  large  enough  by  (2.7.184)  and 
(2.7.181)  follows. 

(b) .  Assume  now  that  xq  >  —00.  Similarly,  apply  (2.7.171)  to  deduce  that  eventually  On  <  e  for 
Neyman-Pearson  tests  with  a  constant  threshold  7  >  xq  and  so  by  (2.7.184) 


lim  sup  -  log  <  xo  +  77 , 

n— *00  TL 


(2.7.185) 


for  any  77  >  0  and  any  e  >  0. 

Moreover,  without  loss  of  generality  one  may  assume  that 


lim  inf  in  >  ^0  ,  (2.7.186) 

for  otherwise,  by  the  weak  law  of  large  numbers  limsup„_ooa„  =  1.  When  (2.7.186)  holds  and 
a„  <  e  then  by  (2.7.182)  and  the  weak  law  of  large  numbers 

lim  inf  Prob;jg(5'„  6  [xq  —  77,  7„])  >  1  -  e  for  any  77  >  0  .  (2.7.187) 

Hence,  by  (2.7.183) 

ilog^„  > 

>  Xo  -  77  +  -  log  Prob^g(6'„  €  [xo  -  77,  7„])  .  .  (2,7.188) 

n 

By  combining  (2.7.187),  (2.7.188)  and  the  optimality  of  these  Neyman-Pearson  tests  one  obtains 

lim  inf  —  log/j'  >  xq  —  77  for  any  77  >  0  .  (2.7.189) 

n— 00  71 

The  desired  limit  (2.7.181)  is  now  a  direct  consequence  of  (2.7.185)  and  (2.7.189).  '  □ 

Exercises: 


2.7.1  Suppose  that  Yi,...,Yn  are  i.i.d.  random  variables  over  the  finite  set  S  =  {ui, .  • . ,  a|2|} 
and  =  S  (namely,  fxo  and  7x1  are  strictly  positive  over  S). 

(a).  Prove  that  xi  =  ir(/xi|/xo)  <  00  and  xq  =  >  -00  (see  Section  2.1  for  the  definition 
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of  relative  entropy). 

(b).  For  T]  6  [0, 1]  define  the  probability  measures 


1-7J 


EL=i  lJ.i{ak)^^o{aky- 


i  = 


Let  7  =  H{fir,\fio)  -  and  prove  that  Ao(7)  =  i7(/i^|//o). 


2.7.2  Consider  the  scenario  of  exercise  2.7.1. 

(a) .  Define  the  conditional  probability  vectors 

filiaj)  =  Prob^oCYi  =  aj  |  Hq  rejected  by  5")  i  =  1, . . . ,  jS]  ,  (2.7.190) 

where  5  is  a  Neyman-Pearson  test  with  a  fixed  threshold  7  =  i7(;i^l/io)  -  and  r/  6  (0, 1). 

Use  Theorem  2.6.1  to  deduce  that  —  f.tr,  as  n  —  00. 

Hint:  You  may  also  find  parts  of  the  proof  of  Theorem  2.1.2  useful  for  solving  this  problem. 

(b) .  Consider  now  the  k-th  joint  conditional  law 

=  Prob^o(^i  IT  =  !  Tfo  rejected  by  J”)  e  S  ,  £  =  1, ...  ,k  . 

Apply  exercise  2.6.3  in  order  to  deduce  that 

, . . . ,  aj^)  =  )/ij,(ajj)  •  • 

Try  to  interpret  this  result. 

2.7.3  Suppose  that  ii||o(2/)  =  ^(y)  does  not  exist  while  io||i(y)  =  ^(v)  does  exists.  Prove 
that  Stein’s  lemma  holds  true  whenever  To  =  -f^^g[log ZoniCYi)]  >  -00. 

Hint:  Split  /ii  into  its  singular  part  with  respect  to  ^0  and  its  restriction  on  the  support  of  the 
measure  ^o- 

2.7.4  Suppose  that  Fi, . . .,  are  the  states  of  a  Markov  chain  over  the  finite  set  S  =  {1,2, . . .,  |i;|} 

where  the  initial  state  of  the  chain  Fq  is  known  a-priori  to  be  some  i  6  S.  The  transition  matrix 
under  Eq  is  Ho  while  under  Ei  it  is  Hi,  both  of  which  are  irreducible  matrices  with  the  same 
set  of  non-zero  values.  Here  the  Neyman-Pearson  tests  are  based  upon  Xj  =  log  and 

T,-  =  for  i  =  0,1.  Derive  the  analogs  of  Theorem  2.7.1  and  Lemma  2.7.1  by  using  the 

results  of  Section  2.4.3. 
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2.8  Generalized  maximum  likelihood  for  finite  alphabets 


This  section  is  devoted  to  yet  another  version  of  the  hypothesis  test  problem  presented  in  Section 
2.7.  In  particular,  the  concept  of  decision  test  is  as  in  definition  2.7.1  and  the  associated  error 
probabilities  are  as  given  in  (2.7.170)  there.  While  the  law  f.iQ  is  again  assumed  known  a-priori, 
here  the  law  of  Yj  under  the  hypothesis  Hi  is  unknown.  For  that  reason,  neither  the  methods 
nor  the  results  of  Section  2.7  apply.  Moreoever,  one  has  to  modify  the  error  criterion  since  requiring 
uniformly  small  (5n  over  a  possibly  large  class  of  mea.sures  may  be  too  strong  (i.e.,  it  may  well 
be  that  no  test  can  satisfy  such  a  condition).  It  is  reasonable  therefore  to  ask  for  a  criterion  which 
involves  asymptotic  limits.  For  finite  alphabets  F  =  {cj , . . . ,  a|T;|}  such  a  criterion  was  suggested 
by  Hoeffding,  as  follows. 


Definition  2.8.1  A  test  S  is  optimal  (for  a  given  rj  >  0)  if,  among  all  tests  which  satisfy 


Umsup  —  loga„  <  —tj 

n— oo  ^ 


(2.8.191) 


1 


the  tests  has  maximal  exponential  rate  of  error,  i.e.  —  Umsup  {  —  log /?„}  is  maximal  (uniformly 

n—^oo  ^ 

over  all  possible  laws  p,\). 


As  win  become  evident  in  Section  ??  a  considerable  weakening  of  this  criterion  is  necessary  for 
more  general  alphabets. 

The  following  lemma  states  that  it  suffices  to  consider  functions  "of  the  empirical  measure  when 
trying  to  construct  an  optimal  test  (i.e.,  the  empirical  measure  is  a  sufficient  statistic  for  this 
problem). 


Lemma  2.8.1  For  any  test  S  with  error  probabilities  there  exists  a  test  S  with  maps 

of  the  form  tS”(x)  =  S{L^,n)  whose  error  probabilities  satisfy 

Um  sup  —  log  dn  <  Um  sup  —  log 

n— -*oo  71  ■  Ti’^oo  71 


Um  sup  —  log  Pn  <  fim  sup  —  log  /3„ 

n— »oo  T7  Tt 


(2.8.192) 


Proof:  For  any  n  E.  Z'^  let  Sq  =  (5")  ^(0)  and  5"  =  (<S”)  ^(1)  denote  the  subsets  of  S”  which 
the  maps  S”'  assign  to  Hq  and  Hi  respectively.  For  z  =  0, 1  and  any  u  E  Cn  let  =  Sf’  D  T{i/) 
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(recall  that  T{i/)  is  the  type  class  of  u,  see  clefinitiou  2.1.2).  Define 


S{i',n)  = 


0  if  |5o"'"|  > 
1  otherwise 


(2.8.193) 


where  |A|  denotes  the  cardinality  of  the  set  A.  The  test  S  specified  in  the  lemma’s  statement  is 
composed  of  the  maps  »S"(x)  =  S(L^,  n). 

Recall  that  for  any  i.i.d.  variables  X  =  (-Yi,...,X„)  with  marginal  /r  G  Mi(S)  and  for  any 
possible  type  i/  G  £n,  the  conditional  measure  Prob^(X  [  =  z/)  is  a  uniform  measure  over  the 

type  class  T(z/).  In  particular,  if  S{v,n)  =  0  then 

iprob^,  {L^  =  i^)<  ^^Prob^,  {if  =  u)  ^  Prob^,  (X  G  .  (2.8.194) 

Therefore 


/3n  =  Prob^;(T^  =  z/)  < 

{l';,S(i/,n)=0}nCn 

2  E(,:5>,„)=0}nz:„  ^  ^  2Prob^,  (X  €  S^)  =  2/3„  (2.8.195) 


which  certainly  implies  that 


11m  sup  —  log  /3„  <  lim  sup  —  log  j5n 

n— oo  71  n— *co  71 


A  similar  computation  shows  that  d„  <  2a„,  thus  completing  the  proof. 


(2.8.196) 

□ 


Considering  from  here  on  tests  which  depend  only  on  the  empirical  type  the  following  is  a 
characterization  of  an  optimal  rule. 


Theorem  2.8.1  (Hoeffding)  Let  the  test  S’  consist  of  the  maps 

0  if  ^(T^Imo)  <  7? 


5*"(x)  = 


1 


otherwise 


(2.8.197) 


Then  S’  is  an  optimal  test. 


Proof:  By  the  upper  bound  of  Theorem  2.1.1 


1 


1  Y 

limsup  —  logProb^o(iro  rejected  by  S’)  =  limsup  — logProb^o(i„  G  {u  ;  H{v\pq)  >  77}) 

n— *00  n  n— >00  n 


-  ~  f  H{u\po)  <  -p 


(2.8.198) 
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Therefore,  S*  obviously  satisfies  the  constraint  on  a„.  Let  (3“  denote  the  error  probabilities 
associated  with  the  test  S".  Then,  by  the  same  upper  bound  (see  (2.1.12)  in  Section  2.1) 

limsup-log^;  =  limsup -IogProb^,(Z,^  €  {i/  :  <  v})  <  -  ,  inf  L/(x^IMi)  =  “4 

n— ►oo  Tl  n— *oo  71 

(2.8.199) 

where  we  have  used  the  fact  that  for  (2.1.12)  to  hold  true,  the  set  over  which  the  minimization  is 
performed  needs  not  be  closed.  When  /,,  =  oo  then  limsup„_Qg  ^log/3*  =  — oo  is  the  best  possible 
exponential  rate.  Thus,  it  suffices  to  check  the  optimality  of  S*  under  laws  for  which  <  oo. 
Fix  one  such  law  /zj.  Clearly  D  is  non-empty  and  moreover 

J?”  =  inf  H{i/\plq)  <  T}  ,  (2.8.200) 


as  max{ir(t'|^o)?  ■^^('^l/^I)}  <  oo  only  when  Tj,  C  fl  Furthermore,  for  any  t}'  >  77* 


T  ^ 

J-n'  — 


inf 

{u:H(v\no)<n'} 


inf 


<  00 


(2.8.201) 


Let  now  S  be  any  test  determined  by  the  binary  function  S{L^,n)  on  Mi(E)  x  whose  error 
probabilities  satisfy  the  constraint  (2.8.191).  Then,  for  any  >  0  and  for  aU  n  >  no{6)  large 
enough 

£„  n  {jz  :  H {ty\ /.lo)  <  u  -  6}  C  £n  H  {1/  :  5(1/,  n)  =  0}  .  (2.8.202) 


For  otherwise,  there  e.xists  some  ^  >  0  and  a  sequence  of  laws  6  £„  (for  infinitely  many  values 
of  n),  such  that  H{un\fio)  <  (t?  -  ^)  while  <S(i7„,n)  =  1.  Then,  by  Lemma  2.1.4,  for  these  values  of 
n, 

Qn  >  Prob^„(Z^  =  Un)  >  {n  +  l)-l-l  ^  ^y\Z\ ^-n{r,-S)  ^ 


implying  that 


lim  sup  —  log  > -(77  —  ^ ) 

n— ►OO  Tl  * 

and  contradicting  the  constraint  (2.8.191).  Therefore,  by  combining  (2.8.202)  and  the  lower  bound 
of  (2.1.16)  one  obtains  for  any  ^  >  0 


lim  inf  —  log 

n— »oo  n 


> 

> 


1  Y 

liminf  —  log  Prob„*(X;r  G  {1/ 

n— *00  n  V  Tl  i. 

—  lim  sup  {  inf 

n— .00  {I'&Cn- ff{u\no)<’l—S} 


■  <V-S}) 

Hiulfzl)}  ^  -i„.s 


(2.8.203) 
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Since  for  any  k  <  cc  the  set  H  n  is  a  dense  subset  of  A/|(S^(,  n  over 

which  both  JT(-|/io)  and  are  continuous  functions,  it  follows  by  (2.8.201)  that  =  4j_5 

as  long  as  77  —  ^  >  rj’.  Thus,  one  may  deduce  from  (2.8.201)  and  (2.8.203)  that 


lim  sup  —  logPn  >  lim  inf  —  log  /3n  >  —  (ina  I-nS  =  —  lim  /n_5 

u— *00  ^  Ti— **oo  72  *0 


(2.8.204) 


Finally,  the  optimality  of  the  test  S"  for  the  law  fxl  results  by  comparing  (2.8.199)  and  (2.8.204) 
provided  that 

lim  Ir,-s  <  h  . 


5— -0 


(2.8.205) 


In  order  to  prove  (2.8.205)  define  the  measure 


,A  Mo(Ot  )/Mo(-M;)  Mi(a.)  >  0 
0  otherwise 


(2.8.206) 


n  S^* .  As 


Note  that  H(nQ\fxo)  =  -log/io(Sp*)  =  7?‘  (see  exercise  2.8.1)  and 
{1/  :  if(i/|/io)  <  77}  is  a  non-empty  compact  set  there  exists  a  measure  u"  such  that 

<  Ir,  ,  <  77 

Consider  now  the  family  of  measures  ug  =  9i.Iq  -|-  (1  —  6)u“  for  9  G  (0, 1).  Note  that 
€  Mi(Spo  n  S^j)  and  ug  converges  to  i/~  pointwise  as  0  -♦  0  thus  implying 

liin  if(i/e|7ii)  =  • 

Moreover,  as  ^f(-|7io)  is  a  convex  function 

H{i'g\iJ.o)  <  9H{fio\iJ.Q)  -f  (1  -  9)H{i^'\iJ.o)  <  9t]’  +  (1  -  9)t} 


(2.8.207) 


implying  that  for  any  S  <  9{t]  -  77*), 


Ir,-s  <  Hit/glul) 


Thus,  (2.8.205)  follows  by  combining  (2.8.207)  and  (2.8.208). 
Remarks; 


(2.8.208) 

□ 


(a).  The  finiteness  of  the  alphabet  is  essential  here  as  (2.8.204)  is  obtained  by  applying  the  lower 
bounds  of  Lemma  2.1.4  for  individual  types  instead  of  the  natural  large  deviations  lower 


bound  for  open  sets  of  types.  Indeed,  for  non-finite  alphabets  a  considerable  weakening  of 
the  optimality  criterion  is  neccessary  as  there  are  no  non-trivial  lower  bounds  for  individual 
types  (see  Section  ??). 

(b).  Both  Lemma  2.8.1  and  Theorem  2.8.1  may  be  extended  to  the  hypothesis  test  problem  for  a 
known  joint  law  /xg  versus  a  family  of  unknown  joint  laws  /x"  provided  that  the  random 
variables  A'i,...,yY„  are  finitely  exchangeable  under  /xg  and  any  possible  /x"  so  that  the 
empirical  measure  is  still  a  sufficient  statistics.  This  is  outlined  in  exercises  2.8.5-2.8.6. 

Exercises; 

2.8.1  Prove  that  for  any  u  £  ii  u  ^  ^xg  then  /r(i/|^o)  >  H{pq\po)  =  V’- 

2.8.2  Provide  an  alternative  derivation  of  (2.8.202)  based  on  the  results  of  Section  2.7. 

Hint:  For  any  finite  alphabet  and  any  known  law  pi,  deduce  from  (2.7.171)  and  exercise  2.7.1 
that  all  probability  measures  within  £„  which  are  of  the  form  i/e(a,)  =  c/Xl(^^,•)^/xo(a,■)^^“^^  where 
0  £  [0,1]  is  such  that  Hivglpo)  <  p  should  eventually  satisfy  Sivg^n)  =  0.  Then,  apply  the  union 
of  probabilities  bound  and  the  volume  estimate  of  Lemma  2.1.1. 

2.8.3  (a).  Let  Xj  for  j  ^  A  be  i.i.d.  random  variables  over  the  finite  alphabet  S  = 
while  Xj  for  j  £  A.  are  unknown  deterministic  points  of  S.  Prove  that  the  test  S*  of  (2.8.197) 
satisfies  (2.8.191)  for  any  deterministic  increasing  sequence  of  positive  integers  A  for  which 
lim^-oo  ^  =  oo. 

Hint:  Let  corresponds  to  A  =  0  and  prove  that  limsup„_<3o  dy(I^,X^*)  =  0  almost  surely, 
with  some  deterministic  rate  of  convergence  which  depends  only  upon  the  sequence  A.  Conclude 
the  proof  by  the  continuity  of  H{-\po)  over  Mi(S). 

(b).  Construct  a  counter  example  to  the  claim  above  when  ^  S. 

2.8.4  Prove  that  under  the  assumptions  of  exercise  2.8.3  part  (a),  if  in  addition  =  S  for 
any  possible  pi  law  then  the  test  S'  of  (2.8.197)  is  an  optimal  test.  How  far  can  you  relax  the 
assumption  that  =  S  for  ail  pi  ? 
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2.8.5  Suppose  that  for  any  n  €  the  random  variables  =  (Xi,...,X„)  over  the  finite 
alphabet  S  have  a  known  joint  law  under  Hq  and  an  unknown  joint  law  under  the  alternative 
Hi.  Suppose  that  for  any  n  the  variables  are  exchangeable  under  both  /ig  and  any  possible 
/ij  (namely,  the  probability  of  any  outcome  X^")  =  x  is  invariant  under  permutations  of  indices 
in  the  vector  x).  Let  denote  the  empirical  measure  (type)  of  X^")  and  prove  that  Lemma 
2.8.1  holds  true  in  this  case.  (Note  that  the  variables  in  X^")  may  well  be  dependent). 

2.8.6  (a).  Consider  the  scenario  described  in  exercise  2.8.5.  Suppose  that  I'^{u\ijlq)  =  oo  implies 
=  I/)  =  0  and 

lim  sup  \—  log  i.Lq{L^  =  v)  +  =  0.  (2.8.209) 

^€C„,/'>(y|Mo)<oo  ^ 

Prove  that  the  test  6’*  of  (2.8.197)  with  7f(L^|;io)  replaced  by  /"(i^j/io)  is  weakly  optimal  in 
the  sense  that  —  lim  sup  {  —  log /3*}  is  maximal  (uniformly  over  all  possible  laws  yii")  among  all  the 

n— *00 

tests  for  which  limsup„_,^  ^loga„  <  limsup„_,^ 

(b).  Apply  part  (a)  to  prove  the  weak  optimality  of  thresholding  of  (2.1.25)  when 

testing  a  given  deterministic  composition  sequence  hm  in  a  sampling  without  replacement  scheme 
against  any  unknown  composition  sequence  for  such  a  scheme. 

2.9  Rate  distortion  theory  for  stationary  and  ergodic  sources 

Throughout  this  section  we  are  interested  in  analyzing  the  following  situation: 

be  a  stationary  and  ergodic  source,  with  alphabet  S,  i.e.  P  is  a  stationary  ergodic  probability 
measure  on  Q  =  ,  the  space  of  semi-infinite  sequences  over  S.  Let  xi,X2,. x„, . . .  denote  an 

element  of  fi,  which  we  say  was  emitted  by  the  source  (A',  T’).  Note  that,  since  V  is  only  ergodic, 
the  random  variables  (Xi,  •  •  • ,  X„,  •  ■  •)  may  weU  be  dependent. 

Next,  let  p{x,y)  :  S  x  S  -+  [0,pmax]  be  a  one  symbol  bounded  distortion  function,  i.e. 
p{x,x)  =  0,  p{x,y)  7^  0  for  X  1/  and  pmax  <  oo.  The.  basic  problem  of  source  coding  is  to  find  a 
sequence  of  deterministic  maps  (codes)  C„  :  S”  — »  S"  of  small  distortion,  where  the  distortion 
of  a  sequence  of  codes  is  lim  sup,^_^  pc„  and 

^  E  C'n(X).)]  ,  (2.9.210) 
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is  the  average  distortion  per  symbol  when  the  code  Cn  is  used.  The  goal  of  the  coding  is  to  allow  one 
to  transfer  the  information  contained  in  sequences  emitted  by  the  source  X  with  small  distortion 
per  symbol,  while  transmitting  as  little  information  as  possible. 

Clearly,  by  taking  Cn  to  be  the  identity  map,  one  obtains  zero  distortion  but  with  no  coding 
gain.  To  get  to  a  more  meaningful  situation,  one  would  like  to  have  a  reduction  in  the  number  of 
possible  sequences  when  using  Cn-  Let  Cn  also  denotes  the  range  of  the  map  Cn  and  |C„|  denotes 
the  cardinality  of  this  set.  The  rate  of  the  code  C„  is  defined  as 

RCn  =  -  log  ICnl  (2.9.211) 

Tlf 

and  the  smaller  Rc„  is,  the  larger  Is  the  coding  gain  when  using  C„.  The  main  coding  theorem, 
due  originally  to  Shannon,  asserts  that  one  cannot  hope  to  get  Rc^  to  be  too  small  -  that  indeed, 
under  a  bound  on  the  distortion,  Rc^  is  bounded  below  in  general  by  some  positive  quantity  and 
that  there  are  codes  which  are  arbitrarily  close  to  this  bound.  The  proof  of  this  statement  which 
is  presented  here  relays  on  the  large  deviations  principle  of  Theorem  2.3.1. 

The  following  definitions  are  required  for  the  precise  statement  of  the  coding  theorem. 

The  distortion  associated  with  any  probability  measure  Q  on  S  x  S  is 

PQ=  [  Pi^^y)  dQ{x,y)  .  (2.9.212) 

Let  Qx  and  Qy  be  the  marginals  of  Q.  Then  the  mutual  information  associated  with  Q  is 

HiQlQ.  X  Q.)  i  (2.9.213) 

when  the  above  integral  is  well  defined  and  finite  and  JI{Q\Qx  X  Qy)  =  oo  otherwise.^ 

The  one  symbol  rate  distortion  function  is  defined  as 

=  ^H{Q\QxxQy)  (2.9.214) 

{Q-(>q<D,  Qx~^i  } 

where  Vi  is  the  marginal  on  E  of  the  stationary  measure  V  on 

The  one  symbol  distortion  function  p{x,y)  implies  the  corresponding  J-symbol  average  distor¬ 
tion  for  J  =  2,3, . . . 

1  J 

/>^-^^((xi,...,xj),(yi,...,yj))  =  jYl  Pi^f-^yi)-  (2.9.215) 

l=i 

'In  information  theory  books  the  mutual  information  is  usually  denoted  by  I{X;  Y).  The  notation  H{Q\Qx  x  Qy) 
is  more  consistent  with  all  other  notations  of  this  book. 
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(2.9.216) 


Thus,  the  ./-symbol  distortion  associated  with  a  probability  measure  Q  on  x  is 

PQ^  =  /  dQ{x,y)  , 

JZ-’xZ-’ 

and  the  mutual  information  associated  with  the  measure  Q  having  marginals  Qx,  Qy  on  is 

^0(=<,y).  (2.9,217) 

The  J-symbol  rate  distortion  function  is  therefore  defined  as 

RjiD)  =  ,  inf  4  ^(Q\Qx  X  Qy)  ,  (2.9.218) 

{Q-.p^^^<D,Qx=Pj} 

where  Vj  is  the  J-th  marginal  (on  E"^)  of  the  stationary  measure  V.  Finally,  the  rate  distortion 
function  is 

R{D)  t  \j^^Rj{D) .  (2.9.219) 

The  source  coding  theorem  states  that  the  rate  distortion  function  is  a  tight  lower  bound  on  the 
limiting  rate  Rc^  of  ^  sequence  of  codes  {Cn}^!  with  distortion  D. 

Theorem  2.9.1  (Source  Coding  Theorem) 

(a) .  Direct  Part:  For  any  Z?  >  0  such  that  R{D)  <  oo  and  any  ^  >  0,  there  exists  a  sequence  of 
codes  {Cn}^i  with  distortion  at  most  D  and  rates  Rc„  <  R(D)  +  <5. 

(b) .  Converse  Part:  For  any  sequence  of  codes  of  distortion  D  and  any  ^  >  0 

limmfn_oo  RCn  ^  +  ^)- 

Remark:  Note  that  |E|  may  be  infinite  and  there  are  no  structural  conditions  on  E  besides  the 
requirement  that  V  be  based  on  On  the  other  hand,  whenever  R{D)  is  finite,  the  resulting 

codes  always  take  values  in  some  finite  set  and  in  particular,  may  be  represented  by  finite  binary 
sequences.  » 

The  proof  of  the  Source  Coding  Theorem  is  presented  via  a  sequence  of  lemmas  with  the  large 
deviations  principle  of  Section  2.3  implying  the  first  lemma  which  is  key  for  the  Direct  Part  of  the 
theorem. 

Lemma  2.9.1  Suppose  Q  is  any  probability  measure  on  ExE  for  which  Iq  =  H{Q\Qx'xQy)  <  oo, 
Qx  =  Vx  and  pQ  <  Pq^xQy  (where  Pq^xQy  distortion  associated  with  Qx  x  Qy  =  Vx  x 
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Qy)-  Let  Zn(x)  =  ^  J2'j=\  Inhere  Yj  are  indevendent  random  variables  distributed  over  E 

according  to  Qy.  Then, 

liminf  —  log  Prob(Zn(X)  <  pq|X)  >  —In  almost  surely  V  (2.9.220) 

where  X  is  a  random  sequence  of  symbols  emitted  by  the  source  X. 

Proof:  The  inequality  (2.9.220)  follows  from  the  lower  bound  of  Theorem  2.3.1  for  the  open  set 
r  =  (-00, pq)  and  the  random  variables  {Zn}^i.  Specifically,  Theorem  2.3.1  is  applied  per 
element  of  fi  =  ,  where  the  conditions  of  this  theorem  hold  almost  surely  by  the  ergodicity  of 

V.  To  verify  that,  let  A„(0)  =  log  jE[e®^''(^^jX]  and  note  that  since  p(-,-)  is  a  bounded  function 
1A„(^)|  <  oo  for  all  5  6  IR  and  any  realization  of  X.  By  Birckhoff’s  ergodic  theorem  [3] 

1  -1  Tl  - 

i\{d)=  lim  -  A„(n6>)  =  lim  -  Vlog  /  e^'’^^J'^'^dQY{y)  (2.9.221) 

n— oo  n  n— CO  n  ^ 

j=l 

exists  almost  surely  V.  Moreover. 

.\(0)  =  log  dV,{x)  =  log  dQx{x)  ,  (2.9.222) 

does  not  depend  on  the  specific  sequence  X  emitted  by  the  source  (recall  that  Qx  =  V\  implying 
the  second  equality  above).  Furthermore,  since  p(-,-)  is  uniformly  bounded  the  function  A(-)  is 
finite  and  differentiable  everywhere  (in  IR^).  Therefore,  by  Lemma  2.3.2,  part  (c)  of  Theorem  2.3.1 
applies,  yielding 

liminf  —  log P(Zn(X)  <  pqIX)  >  -  inf  A'(x)  =  -  Jq  almost  surely  V  (2.9.223) 

n— oo  n  ^  r<pq  /  V  \  / 

Recall  that  A“(i)  =  sup,\(Ax  -  A(A)).  As  A(0)  =  0  (see  (2.9.222)),  A'(0)  =  pq^xQy  (differentiate 
(2.9.222)  and  compare  with  (2.9.212))  and  A'(A)  is  monotonically  nondecrezising  (since  p(-,-)  >  0) 
it  follows  that  for  x  <PQ<  A'(0) 

sup[Ax  -  A(A)]  <  sup[AA'(0)  -  A(A)]  <  A’(A'(0))  =  OA'(O)  -  A(0)  =  0 .  (2.9.224) 

A>0  A>0 

Since  A*(i)  >  0,  it  follows  that  for  x  <  pq 

A*(x)  =  sup[Ax  -  .\(A)]  >  A’*(pq)  (2.9.225) 

A<0 
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and  thus  Jq  —  A’(pq). 


For  any  A  6  define  the  probability  measure  Q\  on  S  x  S  via 

dQ\{x,y)  A 

dQx{x)dQY{7j)  f^e^>’(^'^^dQY(z)  ' 

Since  p(-,-)  is  a  bounded  function  the  measures  Qx  and  Qx  x  Qy  are  equivalent  and  as 
H{Q\Qx  X  Qy)  <  CO  the  relative  entropy  H{Q\Qx)  is  well  defined.  Moreover, 

=  L  dQ(., „)=/«-  A.,  +  A(AI2.9.226) 

Since  (2.9.226)  holds  for  all  A,  it  follows  that  Iq  >  A*(pq).  The  proof  is  now  completed  in  view  of 
(2.9.223)  and  (2.9.225).  □ 

The  proof  of  the  Direct  Part  of  Theorem  2.9.1  is  based  on  a  random  coding  argument,  where 
instead  of  specifically  constructing  the  codes  C„,  the  classes  Cn  of  all  codes  of  some  fiLxed  size  are 
considered.  Let  p„  =  Ec„  [pCn]  be  the  average  of  pc„  over  C„,  where  the  distribution  within  the 
class  Cn  results  by  choosing  the  code  Cn  at  random  according  to  some  probability  measure.  For 
any  probability  measure  over  Cn  there  e.xists  at  least  one  code  in  Cn  for  which  pc„  <  p„.  In  the 
following  lemma  an  upper  bound  on  p„  is  derived  based  on  the  large  deviations  lower  bound  of 
Lemma  2.9.1. 


Lemma  2.9.2  Suppose  Q  is  a  probability  measure  on  S  x  S  for  which  H{Q\Qx  x  Qy)  <  oo  and 
Qx  =  Vi-  Fix  5  >  0  arbitrarily  small  and  let  Cn  be  the  class  of  all  codes  Cn  of  size  |C„I  = 
|^gn(H(<3|Qxx<3v)+5)j .  Then,  there  exist  distributions  on  Cn  for  which  limsup„_^p„  <  pg. 


Proof:  Fix  the  probability  measure  Q  and  let  Iq  =  H{Q\Qx  X  Qy). 

(a).  Suppose  that  pqxxQy  <  PQ  (recall  that  PQxxQy  *5  ^be  distortion  associated  with  the  measure 
Qx  X  Qy  =  Pi  X  Qy).  Let  yi,...,y„  be  i.i.d  according  to  the  law  Qy  and  Y  =  (yi,...,yn)  be 
a  code-word  of  Cn  to  which  all  S"  is  mapped.  Since  Jg  >  0  this  construction  is  always  possible 
(iCnl  >  1)  nnd  it  results  with  a  class  of  codes  Cn,  one  code  per  realization  of  Y.  The  average 
distortion  over  codes  in  Cn  is  now 


Pn  =  EY 


i  £;[p(X,-,r,)|Y]  =  J^^^p{x,y)dV^{x)dQYiy)  =  pQxycQy  .  (2.9.227) 
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Thus,  since  =  PQx>^Qy  —  PQ  proof  of  the  lemma  is  complete. 

(b).  Consider  now  the  case  when  PQxxQy  >  pQ,  Iq  <  oo  and  Qx  =  'Pi-  Let 

kn  =  |Cn|  =  be  as  specified  in  the  statement  of  the  lemma  and  Yj'\  j  = 

i  —  I, . .  .,kn,  he  n  X  kn  i.i.d.  random  variables  of  law  Qy-  The  probability  distribution  on  the 

cla^s  of  codes  C„  is  generated  by  considering  the  codes  with  code- words  =  {Yj'\  . .  for 

i  =  l,...,kn-  Per  realization  of  these  code-words  the  mapping  C„  is  constructed  as  follows.  For 

any  x  G  S"  define  the  set 

■S'n(x)  =  {y  :  -  ^  p{xj,yj)  <  Pq}  (2.9.228) 

i=i 

and  let  Cn(x)  be  any  element  of  Cn  H  5n(x),  where  if  this  set  is  empty  then  Cn(x)  is  arbitrarily 
chosen.  For  this  mapping, 

1  ” 

~  P{Xj',Cn{'X.)j)  <  PQ  -b  Pmax iCnnSn(x)=0 


implying  that 


Pn<  PQ  +  Pma.xProb(C„  O  5n(X)  =  0)  , 


(2.9.229) 


where  X  G  S"  is  a  random  sequence  of  n  symbols  emitted  by  the  source  X  and  the  set  Cn  consists 
of  the  kn  i.i.d.  random  vectors  Y^*).  Clearly, 


Prob(C„  n  5„(X)  =  0)  =  F[Prob(Y<’)  ^  5n(X)  for  all  ilX)] 

=  i;[(l  -  Prob(Y<^)  G  5n(X)|X))*"]  <  .  (2.9.230) 

By  (2.9.229)  and  (2.9.230),  limsup„_^p„  <  pq  provided  that  fcnProb(Y(^)  G  5„(X)|X)  -+  co  as 
n  oo,  in  probability  P.  By  the  definition  of  kn  it  suffices  to  show  that 

liminf-  logProWY^^'  G  F„(X)|X)  >  -Iq  in  Probability  V  ,  (2.9.231) 

n— *oo  n 

in  order  to  complete  the  proof  of  the  lemma.  Since  Lemma  2.9.1  applies  to  Zn(x)  =  ^  p{xji  Yj^^) 
and  {Y(^)  G  5'n(x)}  =  {Zn(x)  <  pq)  the  bound  (2.9.231)  follows.  □ 

The  following  weak  version  of  the  direct  part  of  the  Source  Coding  Theorem  is  an  immediate 
consequence  of  Lemma  2.9.2. 


Lemma  2.9.3  For  any  D  >  0  such  that  Ri{D)  <  oo  and  any  S  >  0  there  exists  a  sequence  of 
codes  with  distortion  at  most  D  and  rates  Rc„  <  i2i(jD)  -b  <5. 
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Proof:  Since  R\{D)  <  oo  there  exists  a  sequence  of  measures  such  that  /gcm)  R\{D) 

while  Pq(m)  <  D  and  =  Pi.  By  applying  Lemma  2.9.2  for  m  =  1,2,---,  it  follows  that 

limsup„_ooP„  <  lim  sup„_oo  PQ(m)  <  D  when  C„  is  the  class  of  all  codes  of  size  and 

5  >  0  is  arbitrarily  small.  The  existence  of  a  sequence  of  codes  Cn  of  rates  Rc„  <  R\{D)  4-  6  and 
of  distortion  lim  sup„_^^  <  D  is  deduced  by  extracting  the  codes  Cn  of  minimal  pc„  from  the 

ensembles  Cn  •  □ 

For  i.i.d.  source  symbols,  R{D)  =  Ri{D)  (see  exercise  2.9.2)  and  the  Direct  Part  of  the  Source 
Coding  Theorem  amounts  to  Lemma  2.9.3.  When  the  symbols  are  dependent  one  needs  the  fol¬ 
lowing  extension  of  Lemma  2.9.2. 

Lemma  2.9.4  Suppose  V  is  ergodic  with  respect  to  the  J-th  shift  operation  (namely,  it  is  ergodic 
in  blocks  of  size  J ).  Then,  for  any  probability  measure  Q  on  2"^  x  with  Qx  =  Vj  and  for  any 
S  >  0  there  exists  a  sequence  of  codes  Cn  of  rates  Rc„  <  jH{Q\Qx  x  Qy)  +  S  and  of  distortion  at 
most  pQ^ . 

Proof:  Consider  the  enlarged  alphabet  2-^  and  regard  each  block  of  J  consecutive  symbols  of  the 
emitted  source  sequence  as  one  symbol  from  2“^.  A  sketch  of  the  proof  which  basically  follows  the 
proofs  of  Lemmas  2.9.1  and  2.9.2  is  presented  here.  Let 

log  dVj{x)  .  (2.9.232) 

Then,  by  the  ergodicity  of  P  in  blocks  of  size  J 

A(‘^)(6>)  =  lim  log  almost  surely  P,  (2.9.233) 

where  z\^j{'x.)  =  and  Xi,X2,...Xn  are  concatenated  symbols  namely,  are  el¬ 

ements  of  2“^  (while  Yi,...Y„  are  i.i.d.  random  variables  of  law  Qy  over  2“^).  ThuSy Theorem 
2.3.1  is  once  again  applicable.  To  complete  the  proof,  define  Q;^  now  via 

dQxix,y)  A  gJAptJ)(x,y) 
dQxi^)dQy{y)  ~  ’ 

and  observe  that  for  any  A  G  and  any  probability  measure  Q  over  2-^  X  2-^  with  Qx  =  Pj 
0<jHiQ\Qx)=jH{Q\QxxQy)-Xp^^^  +  A^'^){X). 
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Thus,  ^HiQlQx  X  Qy)  >  □ 

The  following  corollary  is  an  immediate  consequence  of  Lemina  2.9.4  (by  adapting  the  proof  of 
Lemma  2.9.3  to  J  ^  1). 

Corollary  2.9.1  If  V  is  ergodic  in  blocks  of  size  J  for  any  J  6  Z'^  then  the  direct  part  of  the 
Source  Coding  Theorem  holds. 

While  in  general,  an  ergodic  V  might  be  non  ergodic  in  blocks,  Corollary  2.9.1  holds  true  for  any 
stationary  and  ergodic  V  (see  for  example  [2],  pp.  278-280,  [13],  pp.  496-500).  It  clearly  suffices 
for  that  purpose  to  prove  the  following  lemma. 

Lemma  2.9.5  Lemma  2.9.4  holds  true  for  any  stationary  and  ergodic  V  and  any  J  6 


Proof:  It  is  possible  to  show  that  when  considering  blocks  of  size  J,  the  emitted  infinite  sequences  of 
the  source  may  almost  surely  be  divided  into  J  equally  probable  ergodic  modes,  Eq,  . . . ,  Ej-i,  such 
that  if  sequence  (xi,  X2,  •  ■  • ,  in,  •  •  •)  belongs  to  mode  Ei  then  (ii+fc,  a;2+fc,  •  •  • ,  2:n+fc,  •  •  •)  belongs  to 
the  mode  ^(..(.ijmodj  [2]»[13]).  This  implies  that  V  =  jE/Jo  where  are  ergodic 

in  blocks  of  size  J  and  correspond  to  the  ergodic  modes  Ei.  By  projections  onto  these  ergodic  modes, 
each  law  Q  on  x  with  Qx  =  Vj  IS  similarly  decomposed  into  Q  =  j  EiLo^ 
is  the  J-th  marginal  of  P*')  while  p^q'^  =  j  Yli=o  PqO) 

HiQlQx  X  Qy)  >  4  E  X  Qy)  ■  (2.9.234) 

i=o 


By  applying  Lemma  2.9.4  for  there  exist  J  sequences  of  codes  i  =  0, . . . ,  J  -  1 

with  rates  (Q^'^IQx  x  Q^y)  +  ^  distortions  at  most  with  respect  to  the  source 

measures  {'P^*^}i=Q,  respectively.  A  sequence  of  codes  of  rates 


1  1 
s  7  E  +  77X777 ‘“S  J 


Rc 


(n+l)J 


J 


i=0 


o^n'>  '  (n+l)J 


(2.9.235) 


and  of  distortion  at  most  Pq'^  with  respect  to  V  is  now  constructed  as  follows.  The  code  is 


the  union  of  J  codes  each  of  cardinality  n/=o^  and  length  (n  +  1)J.  The  code 


j-i 


T«’)t 


C\'\  ,,  T  consists  of  all  the  distinct  words  in 


86 


{(j/o>  ‘I’l  2/1)  a”.  •  •  •  ?  yj-i-,  a*)  :  yk  €  =  0, . . . ,  J  —  1}  and  a*  G  S  is  a  fixed  separator 

symbol.  This  construction  guarantees  that  the  sequence  of  codes  has  distortion  at  most 

with  respect  to  the  source  measure  and  thus  has  at  most  this  distortion  with 

respect  to  the  source  measure  V.  By  (2.9.234)  and  (2.9.235)  for  all  n  large  enough,  Rc*  < 

(n+l)J 

jJI(Q\Qx  X  Qy)  +  2S.  Finally,  for  code  length  which  is  not  an  integer  multiple  of  J,  one  may 
easily  modify  the  code  C^n+i)J  closest  length  while  neither  alTecting  the  limiting  distortion  per 
symbol  nor  the  limiting  rate  (as  n  oo).  The  proof  is  thus  complete.  □ 

The  last  lemma  in  this  section  is  devoted  to  the  converse  paxt  of  the  Source  Coding  Theorem, 
whose  proof  is  based  on  information  theoretical  arguments  and  not  on  large  deviations  bounds. 

Lemma  2.9.6  For  any  sequence  of  codes  of  distortion  D  and  for  any  <5  >  0 

liminf„_oo  Rcn  ^  R{D  +  <5). 


Proof:  It  suffices  to  consider  codes  C„  of  finite  rates  and  of  distortion  D.  Such  a  code  Cn  is  a 
mapping  from  S”  to  E”.  When  its  domain  E"  is  equipped  with  the  probability  measure  Vn  (the 
n-th  marginal  of  the  source  measure  V),  Cn  induces  a  (degenerate)  joint  measure  on  E”  x  E”. 
Note  that  =  Vn  and  Pq(„)  =  pc^  <  I?  +  ^  for  all  n  large  enough  (and  any  S  >  0).  Therefore, 
X  Qy  ^)  >  nRn{D  +  ^)  >  nR{D  +  S)  for  any  <5  >  0  and  any  n  large  enough.  Since  the 
marginals  have  the  finite  support  sets  C„ 


niZc„=  logical  ’ 

where  the  entropy  is  defined  as  H{Qy^)  =  E!=i' Qy  ^(2/0 iog  “TTri -  (compare, with  the 

Qy  (y>) 

definition  in  Section  2.1.1).  Let  fn{x,yi)  =  '•  Then,  /„  :  E'^  x  E'*  [0,oo)  is  well 

dQ\  ’ (x)dQ\.  ’ {yi) 

defined,  /^n  fnix,yi)  dQ^x\^)  =  i  ^  well  as  /n(a;)  J/i)  Qy '(2/i)  =  1  almost  surely 

Q^^  Thus,  fnix,yi)Q^\yi)  <  1  almost  surely  Q^^  and 

i^nj  r  * 

ffWi’'*)  - X  gi?')  =  f:  Q‘f\y,)  log-j^-  /  Ux,y,)logM^,yddQ<'^\x) 

.=1  L  QyT2/i) 

r  ,  ) 

=  ^  2Z  fn{x,yi)QP{yi)  log  | - rrt - 1  >  >  0 


i=l 


/n(a:,2/0Qy 


(2.9.236) 


Therefore, 


lim inf  >  liminf  -.H'(QI"’)  >  liminf  ^^(Q^^^IQt^  x  >  RID  +  <5)  . 

n— »oo  n-^oo  u  ^  n— ►<»  n  -vr/— .  \  / 
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Exercises: 


2.9.1  (a).  Prove  that  when  S  is  a  finite  set  and  Ri{D)  >  0,  there  exists  a  probability  measure 
Q  on  S  X  S  for  which  Iq  =  R\{D),  Qx  =  “Px  and  pg  =  D. 

(b).  Prove  that  for  this  measure  also  A'“(pq)  =  Iq. 

2.9.2  Prove  that  when  P  is  a  product  measure  (namely,  the  emitted  symbols  Xx,X2,  •  •  • ,  Xn  are 
i.i.d.)  then  Rj{D)  =  Rx{D)  for  all  J  G 

2.9.3  (a).  Show  that  (m  +  n)i2m+n(jD)  <  mRm{D)  +  nRn{D)  for  any  two  integers  m,  n. 

(b).  Conclude  that  if  limsupy_,^  Rj{D)  <  oo  then  R{D)  =  Rj{D). 

emphasize  in  intr.  that  deal  with  iid  case,  correctiuons  are  denoted 

2.10  Refinements  of  large  deviations  statements  in 

Cramer’s  theorem  deads  with  the  tails  of  the  empirical  means  5n-  On  a  finer  scale,  at  least  in  the  i.i.d. 
case,  the  random  variables  ^/nSn  possess  a  limiting  Normal  distribution  by  the  classical  Central 
Limit  Theorem.  In  this  situation,  the  empirical  means  n^Sn  satisfy  a  large  deviations  principle  for 
any  /3  G  (0,  |),  but  always  with  a  quadratic  (Normal-like)  rate  function.  This  statement  is  made 
precise  in  the  following  theorem. 

Theorem  2.10.1  Let  Xi, . . . .  Xn  be  a  sequence  o/lR'^  valued  i.i.d.  random  variables  with  E{Xi)  = 
0  such  that  Aa'(A)  =  log  <  oo  in  some  open  ball  Bq^s  around  the  origin.  Fix  (5  G  (0,|’) 

and  let  Zn  =  A”,-  =  n^Sn-  Then,  Zn  satisfy  a  large  deviations  principle  in  governed 

by  the  good  rate  function 

4(^)  =  ^  <  x,Cx  >  ,  (2.10.237) 

where  C  is  the  covariance  matrix  of  Xi.  Moreover,  any  open  or  closed  set  G  is  an  Ig  continuity 


Proof:  This  theorem  follows  from  the  general  large  deviations  statement  of  Section  2.3  with  a-n  = 
Indeed,  in  the  notations  of  Section  2.3 


A„(a;iA)  =  log£(e“"'<-'’^’'>) 

n 

=  ^  log  £  (e""'’  =  n  log  £  (e"“^  (2.10.238) 

1=1 

Therefore 

A(A)  =  n^'^log  £  (2.10.239) 

Note  that  n~^  — ►  0  and  therefore,  by  our  assumption  that  A(A)  <  oo  in  an  open  ball  around  0, 

n— ♦oo 

for  each  A  6  there  e.>dsts  an  no  large  enough  such  that  for  all  n  >  no,  £  <  oo, 

and  by  dominated  convergence 

£  =  1  +  n-^  £[<  A,  A'l  >]  +  i  n'^^  £[<  A,Xi  >^1  +  O  (n'^^)  (2.10.240) 

Substituing  (2.10.240)  into  (2.10.239)  one  obtains  (using  the  identity  £[<  A,A'i  >]  =  0) 

A(A)  =  lim  n^^log  |l  4- ^  n“^'^£[<  A,  Ai  >^]  4- 0(n~^^)|’ 

=  i£[<  A,Xi  >2]  =  i  <  A,CA  >2  (2.10.241) 


Thus, 


A'(x)  =  sup  {  <  A,x  >  -.A(A)}  =  sup  {  <  A,x  >  <  A, CA  >}  =  i  <  x,  Ci  >  =  Ig{x) 

(2.10.242) 

Since  .A“(A)  =  A  CA  >  is  differentiable  and  finite  everywhere  Theorem  2.3.1  applies  and  the 
proof  is  thus  complete.  □ 


Remarks: 

t 

(a) .  Note  that  Theorem  2.10.1  is  nothing  more  than  what  one  would  obtain  from  a  naive  Taylor 

expansion  applied  on  the  anastaz  Prob(5n  =  x)  ?s  where  /(•)  is  the  rate  function  of  Theorem 

2.3.2  (see  Section  2.3). 

(b) .  The  rate  of  convergence  in  (2.10.241)  is  of  0{n~^)^  suggesting  a  similar  convergence  rate  for 
a„logProb(£n  €  G)  (which  converges  to  —  inf^gcj  Ig{x)  for  any  open  G).  Indeed,  such  a  result  is 
proved  in  [12]  (pp.  552-553)  for  d  =  1  and  G  =  (x,oo).  This  extends  for  example  the  validity  of 
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the  Normal  approximation  for  the  distribution  of  x/n5„  to  intervals  of  o(n6 )  (which  correspond  to 
/3  >  j  here). 

(c).  A  similar  result  may  be  obtained  in  the  context  of  Markov  additive  processes  (see  Section 
2.4.1). 

Another  refinement  of  Cramer’s  theorem  involves  a  more  accurate  estimate  of  the  laws  of  Sn 
(the  empirical  means  of  i.i.d.  random  variables).  Specifically,  for  a  “nice”  I  Continuity  Set  A  one 
seeks  an  estimate  J„  of  Hn{A)  such  that  limn_oo  Jn^n{A)  =  1.  Such  an  estimate  is  an  improvement 
over  the  normalized  logarithmic  limit  ilog/i„(A)  implied  by  a  large  deviations  principle.  The 
following  theorem  deals  with  the  estimate  Jn  for  certain  half  intervals  A  =  [q,Qo)  C  IR^ 

Theorem  2.10.2  (Bahadur  and  Rao)  Let  denotes  the  law  of  Sn  =  ^  Y2=i  where  Xi  are 
i.i.d.  real  valued  random  variables  and  A(A)  =  log  is  the  logarithmic  moment  generating 

function  of  Xi.  Consider  the  set  A  =  [g,co)  where  q  e  X,  namely  q  =  A'(77)  for  some  positive 
V  e  VI. 

(a) .  If  the  law  of  X\  is  non-lattice,  then 

lim  J„/i„(A)  =  1  (2.10.243) 

n— *00 

where  Jn  =  2zn 

(b) .  Suppose  Xi  has  a  lattice  law,  namely,  for  some  finite  xq,  d,  the  random  variable  j(Xi  —  xq) 
is  with  probability  one  an  integer  value.  Assume  further  that  1  >  Prob(A’i  =  q)  >  0  (in  particular, 
this  implies  that  ^{q  —  xq)  is  an  integer  and  that  A"(p)  >  0).  Then, 

lim  JnMn(A)  =  .  (2.10.244) 

n-^oo  1  —  € 

Remarks:  (a).  Recall  that  A~{q)  =  vq  -  A{t))  and  A(-)  is  C°°  in  some  open  neighborhood  of  7?  by 
the  dominated  convergence  argument  (for  details  see  Lemma  2.2.1  and  exercise  2.2.5).  ' 

(b) .  .Actually  the  limit  relations  (2.10.243)  and  (2.10.244)  remain  valid  even  for  small  intervals  of 
size  of  0(i2|A)  (see  exercise  2.10.1). 

(c) .  The  proof  of  this  theorem  is  based  on  an  exponential  translation  of  a  local  Central  Limit 
Theorem.  This  approach  is  applicable  for  the  dependent  case  of  Section  2.3  and  to  certain  extent 
applies  also  in  IR”^,  d  >  1. 
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Proof:  (a).  Consider  the  probability  measure  p.  defined  by  dp{x)  =  and  let  Y{  = 


(X,-g)/\/A"(r?),  for  i  =  1, 2, . . . ,  n.  Note  that  Yi,...,Yn  are  i.i.d.  random  variables  with  E^[Yi]  =  0 
and  E(i[Y^]  =  1  (this  can  be  easily  checked  by  computing  the  first  two  moments  of  Xi  under  p). 
Let  Fn{x)  denotes  the  distribution  function  of  Wn  =  ^  2Zr=i  iin-der  the  measure  p. 

Since  X,  are  non-lattice,  the  Berry-Esseen  expansion  of  jFn(®)  results  with: 

m3 


lim  ( -v/n  sup 

n-*oo  j; 


Fnix)  -  ^X) 


6y/n 


(1  -  X^)4>{x) 


}=0, 


(2.10.245) 


where  m3  =  Eji[Yi]  <  00,  4>{x)  =  ^^exp(-x^/2)  is  the  standard  Normal  density,  and 
$(i)  =  <^>{‘n)dv  (for  the  derivation  of  (2.10.245)  see  [12]  page  512). 


Now, 


Ai„(A)  =  /Zn([g,<X)))  =  = 

g-nA*(9)  (2.10.246) 

Jo 


n>0 


since  Sn  =  q  +  ^n-  Let  rpn  =  rfs/nKPJj^.  By  an  integration  by  parts  in  (2.10.246)  one 

obtains 


/•CO  -  too  r  /  i  \ 

J„Mn(A)  =  ^le-'>'--[Fnix)  -  P„(0)]  dx^V^J^  ^ne-‘  [fn  -  f’n(O) 


dt 

(2.10.247) 


Consider  now 


c„  =  /  0„e 


-t 


$ 


+ 


m3 

6y/n 


dt  (2.10.248) 


Comparing  (2.10.247)  and  (2.10.248),  observe  that  the  Berry-Esseen  expansion  (2.10.245)  yields 
the  relation  lim„_oo  l</nMn(A)  -  c„l  =  0.  Moreoever,  since 

supj^'(i)l  <  00,  lim  |^'(a:)|  =  0  *(2.10.249) 

r>o 

it  follows  by  a  Taylor  expansion  of  $  ^-nd  the  dominated  convergence  theorem  that 

•  #(0)1  dt  = 


/T-  r 


lim  Cn  =  lim  '/2t  I  I# 

-  "  n—00  Jq 


i’n 


=  lim  v/2x  f  e 

n—00  Jq 


■  V  dt  =  #(0)  e-‘  dt  =  l.  (2.10.250) 
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This  completes  the  proof  for  the  non-lattice  case. 

(b).  In  the  lattice  case  (where  the  range  of  Yi  is  \  m-j===~  I  )  the  Berry-Esseen  expansion 

(2.10.245)  is  modified  to 


lim  <  \/nsup 


Fn{x)  -  $(x)  -  (1  -  x^)  <t>{x)  -  4>{x)g  I  X, 


\/A"(7?)n 


=  0  (2.10.251) 


where  g{x,h)  =  |  —  (x  mod  h)  if  (x  mod  h)  0  and  g{x,h)  =  — ^  if  (x  mod  h)  =  0  (see  [12], 
page  513  or  [22],  page  171,  Theorem  6).  Thus,  by  adopting  the  argument  above  for  the  lattice  case, 
one  obtains 

Jim  J,^(A)  =  I  +  Jim  VJif  [4.  (±)  5  (;^,  g)  -  m)9  (o.  f)  ]  dt  (2.10.252) 
Since  i>ng  (it’ follows  that 


lim  J„/in(A)  =  1  +  lim  V2x 


0 


g(t,  rjd)  -  ^(0)5(0,  Tjd)  V  dt 


e  ‘  lg(t,  gd)  -  5(0, 7?d)]  dt. 


The  proof  is  completed  by  combining  (2.10.253)  with 


e  ^[g{t,  T}d)  -  g{Q,gd)]dt  =  ^  e  \gd-t)dt=  y 


gd 


-Tid 


-  1. 


(2.10.253) 

(2.10.254) 

□ 


Exercises: 

2.10.1  (a).  Let  A  =  [g,  g-i-^),  where  in  the  lattice  case  j  is  restricted  to  be  an  integer.  Prove  that 
for  any  a  €  (0,oo),  both  (2.10.243)  and  (2.10.244)  hold  with  J„  =  g^/A"{g)2xn 

(b).  As  a  consequence  of  part  (a)  above  conclude  that  for  any  set  A  =  [q,q  +  bn)  both  (2.10.243) 
and  (2.10.244)  hold  for  J„  as  given  in  Theoren  2.10.2  as  long  as  lim„_oo  n6„  =  00. 

2.10.2  (a).  Let  g  >  0  denote  the  minimizer  of  A(A)  and  suppose  that  A(A)  <  co  in  some  open 
interval  around  g.  Based  on  exercise  2.10.1,  deduce  that  the  limiting  distribution  of  Sn  conditional 
upon  5n  >  0  is  Exponential(r7)  when  Xi  has  a  non-lattice  distribution. 

(b).  Suppose  now  that  Xi  has  a  lattice  distribution  of  span  d  and  1  >  Prob(Xi  =  0)  >  0. 
Deduce  now  that  the  limiting  distribution  of  conditional  upon  5n  >  0  is  Geometric(p)  with 
p  =  I  -  (i.e.,  Prob(5n  =  kd\Sn  >  0)  —  pq^  for  /:  =  0, 1, 2,  •  ■  •). 
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2.10.3  Consider  a  Neyman-Pearson  test  with  constant  threshold  7  €  (xo,xi)  (see  Section  2.7 
for  details).  Suppose  that  Xi  =  has  a  non-lattice  distribution.  Let  A-,,  €  (0,1)  be  the 

unique  solution  of  Aq(A)  =  7.  Deduce  from  (2.10.243)  that 

^1^  27rn|  =  1  (2.10.255) 


and 


(2.10.256) 


93 


Chapter  3 


Historical  notes  and  references 


Although  much  of  the  credit  for  the  modern  theory  of  large  deviations  and  its  various  applications 
must  go  to  Donsker  and  Varadhan,  the  topic  is  much  older  and  references  to  the  various  aspects 
of  it  may  be  traced  back  to  the  early  1900’s.  Due  to  our  own  ignorance  we  will  necessarily  confine 
ourselves  here  to  an  incomplete  list  of  references  and  historical  credits.  We  hope  to  expand  and 
correct  this  list  at  a  later  stage,  and  apologize  to  those  who  are  not  given  due  credit. 


3.1  Chapter  2 

The  early  development  of  large  deviation  bounds  did  not  follow  the  order  of  our  presentation. 
Statisticians,  starting  %vith  Khinchin  [18],  have  anaylsed  various  forms  of  Cramer’s  theorem  for 
special  random  variables.  See  [27],  [20]  and  [21]  for  additional  references  on  this  early  work. 

The  first  statement  of  Cramer’s  Theorem  for  distributions  on  R  possessing  densities  is  due 
to  Cramer  [7],  who  introduced  the  change  of  measure  argument  to  this  context.  An  extension  to 
general  distributions  was  done  by  Chernoff  [6],  who  introduced  the  upper  bound  which  was  to  carry 
his  name.  There  exists  a  large  body  of  literature  concerning  the  applications  of  Cramer’s  theorem 
to  the  analysis  of  statistical  tests,  on  which  we  keep  silent. 

Although  Stirling’s  formula,  which  is  at  the  heart  of  the  combinatorial  estimates  of  section  2.1, 
dates  back  at  least  to  the  19-th  century,  the  notion  of  types  and  bounds  of  the  form  of  Lemmas 
2. 1.1-2. 1.4  had  to  wait  until  information  theorists  discovered  that  they  are  useful  tools  for  analyzing 
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the  efficiency  of  codes.  For  early  references  we  refer  the  reader  to  the  excellent  book  of  Gallager 
[12].  Our  treatment  of  the  source  coding  theorem  in  section  2.9  is  a  combination  of  the  method  of 
that  book  with  the  particular  case  treated  by  Bucklew  in  [4]. 

The  credit  for  the  extension  of  Cramer’s  theorem  to  the  dependent  case  should  definitely  go  to 
Gartner  [14]  who  considered  the  case  in  which  Pa  =  IR*^.  Ellis  [10]  extended  this  result  to  the  steep 
set  up  and  the  formulation  of  section  2.3  is  nothing  but  an  embellishment  of  his  results. 

The  large  deviations  statements  for  Markov  chains  have  a  long  history  which  is  partially  de¬ 
scribed  in  the  historical  notes  of  Chapter  ??.  The  approach  taken  here  is  based  in  part  on  the  ideas 
of  Ellis  [10]. 

The  material  in  section  2.5  is  a  large  deviations  proof  of  the  results  in  [22]. 

Gibb’s  conditioning  principle  has  served  as  a  driving  force  behind  Ruelle  and  Landford’s  treat¬ 
ment  of  large  deviations  (without  calling  it  by  that  name),  [24],  [25],  [19].  The  form  of  the  Gibb’s 
principle  here  was  proved  using  large  deviations  methods  (via  the  method  of  types)  by  Campenhout 
and  Cover  [5]  and  in  greater  generality  by  Csizar  [8]  and  Stroock-Zeitouni  [28]. 

No  references  yet  for  Stein’s  lemma. 

The  generalized  maximum  likelihood  of  section  2.8  was  considered  by  HoefFding  [16],  whose 
approach  we  basically  follow-here.  The  extension  to  general  state  space  presented  in  section  ??  is 
due  to  Zeitouni  and  Gutman  [30]. 

Finally,  the  refinements  of  the  large  deviations  principles  discussed  in  sections  2.10  and  ?? 
follow  [1],  [21],  although  some  of  the  methods  are  much  older  and  may  be  found  in  Feller’s  book 
[11]. 


95 


Bibliography 


[1]  R.R.  Bahadur  and  R.  Ranga  Rao.  On  deviations  of  the  sample  mean.  Ann.  Math  Statistics, 
38:1015-1027,  1960. 

[2]  T.  Berger.  Rate  Distortion  Theory.  Prentice  Hall,  1971. 

[3]  P.  Billingsley.  Convergence  of  Probability  Measures.  Wiley,  1968. 

[4]  J.  A.  Bucklew.  Large  deviations  techniques  in  decision,  simulation,  and  estimation.  Springer, 
1991. 

[5]  J.M.  Van  Campenhout  and  T.M.  Cover.  Maximum  entropy  and  conditional  probability.  IEEE 
Transc.  Inf.  Theory,  IT-27:483-489,  1981. 

[6]  H.  Chernoff.  A  measure  of  asymptotic  efficiency  for  tests  of  a  hypothesis  based  on  the  sum  of 
observations.  Ann.  Math.,  Statist.,  23:493-507,  1952. 

[7]  H.  Cramer.  Sur  un  nouveau  theoreme-limite  de  la  theorie  des  probabilites.  In  Actualites 
Scientifiques  et  Industrielles,  volume  3  of  Colloque  consacre  a  la  theorie  des  probabilites,  pages 
5-23.  Hermann,  Paris,  1938. 

[8]  I.  Csiszar.  I-divergence  geometry  of  probability  distributions  and  minimizations  problems. 
Ann.  Probab.,  3:146-158,  1975. 

[9]  I.  Ekeland  and  R.  Temam.  Convex  Analysis  and  Variational  Problems.  North  HoUand,  Ams¬ 
terdam,  1976. 

[10]  R.  S.  Ellis.  Large  deviations  for  a  general  class  of  random  vectors.  Ann.  Probab.,  12:1-12, 
1984. 


96 


[11]  W.  Feller.  An  Introduction  to  ProhabiUty  Theory  and  Its  Applications,  volume  1.  John  Wiley 
and  Sons,  New  York,  1957. 

[12]  R.G.  Gallager.  Information  Theory  and  Reliable  Communication.  Wiley,  New  York,  1968. 

[13]  F.G.  Gantmacher.  The  Theory  of  Matrices.  Chelsea,  1959. 

[14]  J.  Gartner.  On  large  deviations  from  the  invariant  measure.  Theory  Probab.  AppL,  22:24-39, 
1977. 

[15]  U.  Grenander  and  G.  Szego.  Toeplitz  Forms  and  their  Applications.  University  of  C^fornia 
Press,  1958. 

[16]  W.  HoefFding.  On  probabilities  of  large  deviations.  In  Proceedings  of  the  Fifth  Berkeley 
Symposium  on  mathematical  Statistics  and  Probability,  pages  203-219.  Univ.  of  California 
Press,  1965. 

[17]  I.  Csizar,  T.M.  Cover  and  B.S.  Choi.  Conditional  limit  theorems  under  Markov  conditioning. 
IEEE  Trans.  Inf.  theory,  IT-33:788-801,  1987. 

[18]  A.I.  Khinchin.  Uber  einen  neuen  grenzwertsatz  der  wahrscheinlichkeitsrechnung.  Math  An- 
nalen,  101:745-752,  1929. 

[19]  O.E.  Landford.  Entropy  and  equilibrium  states  in  classical  statistical  mechanics.  In  A.  Lenard, 
editor,  Statistical  Mechanics  and  Mathematical  problems,  volume  20  of  Lecture  Notes  in 
Physics,  pages  1-113.  Springer,  Berlin,  1973. 

[20]  Y.  V.  Linnik.  On  the  probability  of  large  deviations  for  the  sums  of  independent  variables.  In 
Proceedings  of  the  Fourth  Berkeley  Symposium  on  Mathematical  Statistics  and  Probability, 
pages  289-306,  Berkeley,  1961.  Univ.  of  California  Press. 

[21]  V.  V.  Petrov.  Sums  of  independent  random  variables.  Springer,  Berlin,  1975.  Translated  by 
A.  A.  Brown. 

[22]  R.  Arratia,  L.  Gordon  and  M.S.  Waterman.  The  Erdos- Renyi  law  in  distribution  for  coin 
tossing  and  sequence  matching.  Ann.  of  Statistics,  1990. 


97 


[23]  R.  T.  Rockafellar.  Convex  Analysis.  Princeton  University  Press,  Princeton,  1970. 

[24]  D.  Ruelle.  Correiation  functionals.  J.  math.  Physics,  6:201-220,  1965. 

[25]  D.  Ruelle.  A  variational  formulation  of  equilibrium  statistical  mechanics  and  the  Gibbs  phase 
rule.  Comm.  math.  Phys.,  5:324-329,  1967. 

[26]  E.  Seneta.  Non  Negative  Matrices  and  Markov  Chains.  Springer- Verlag,  1981. 

[27]  N.  Smirnoff.  Uber  warhscheinlichkeiten  grosser  abswichungen.  Rec.  Sco.  Math.  Moscou, 
40:441-455,  1933. 

[28]  D.W.  Stroock  and  0.  Zeitouni.  Micro  canonical  distributions,  Gibbs’  states,  and  the  equiv¬ 
alence  of  ensembles.  In  R.  Durett  and  H.  Kesten,  editors,  Festchrift  in  honour  of  F.  Spitzer. 
Birkhauser,  1990. 

[29]  T.M.  Cover  and  J.B.  Thomas.  Elements  of  Information  Theory,  forthcoming,  1991. 

[30]  0.  Zeitouni  and  M.  Gutman.  On  universal  hypotheses  testing  via  large  deviations.  IEEE 
Trans.  Inf  Theory,  1991. 


98 


